Sharding and partitioning are essential aspects of storing time series data. Partitioning allows data to be subdivided into smaller, more manageable pieces, enhancing read and write performance. Sharding is a strategy used to distribute the data horizontally across multiple nodes, thereby improving the speed of queries and updates.
Partitioning in CrateDB is a powerful feature that allows you to split up a large table into smaller chunks, or partitions. By doing so, CrateDB minimizes the number of records that need to be scanned during a query, which can significantly improve performance, especially for large datasets.
For time series data you usually use a timestamp or date column to define partition criteria. For example, you can partition data by month, as shown in the provided SQL statement for creating the weather_data
table. In this table, the month column is a generated column that uses the date_trunc
function to extract the month from the timestamp column. The resulting table is partitioned by the month column, which means that each partition will contain records for each individual month only. For quarterly, daily or hourly partitioning you would need to adjust the generated column expression to truncate the timestamp to a specific interval.
New partitions are automatically created on the first data ingest that falls into a new partition interval. You do not need to manually manage the creation of partitions as your data grows. Instead, CrateDB handles this for you, creating new partitions whenever they are needed based on the partitioning strategy you have defined.
Partitions in CrateDB offer a level of granularity that allows for individual management and maintenance. This means that each partition can be backed up, closed, deleted, or archived individually. This flexibility provides efficient data management, particularly important in handling large time series data. For example, if a specific partition contains outdated or irrelevant data, it can be removed without impacting the rest of the database. Similarly, individual backups allow for specific data recovery, while archiving can help optimize storage and improve overall performance.
Sharding in CrateDB is a fundamental concept that allows for the horizontal scaling of data storage and distributed query processing. When you create a table in CrateDB, the data is not just stored in a single monolithic structure; instead, it is split into shards, which are distributed across the nodes in your cluster.
For instance, consider the SQL statement for creating a weather_data
table. The CLUSTERED INTO 3 SHARDS
clause tells CrateDB to divide each partition into three shards. Each partition’s data will be split across the shards, and each shard will hold roughly one-third of the data. The number of shards is configurable, and you can decide how many shards to create based on the size of your dataset, the number of nodes in your cluster, and the amount of CPUs per node. As the data volume can grow over time, the number of shards per partition can be changed over time and therefore allowing to create more shards for more recent partitions. This is particularly useful when onboarding, for example, additional devices or weather stations. Another case might be gathering measurements in a higher frequency and therefore generating more data than initially planned. Increasing the number of shards might be necessary to ensure consistent performance.
Sharding, combined with partitioning, provides two levels of data organization. In the provided SQL statement, the weather data table is not only partitioned by month, but it is also separated into three shards. This setup allows CrateDB to perform operations in parallel across different shards and partitions, which can significantly improve query performance, especially for large time-series datasets and concurrent queries. The performance can be increased by sharding on single-node instances as well, as queries and aggregations can be executed in parallel and existing hardware is utilized in the best way.
The number of replicas in your CrateDB cluster is a crucial factor that depends on the availability of Service Level Agreements of your application. It's recommended to have at least one replica for each shard, but two replicas would provide a better fault tolerance. CrateDB ensures that primary and replica shards are automatically distributed across nodes in the cluster, enhancing data availability and durability. In the event of node failures, CrateDB automatically promotes replica shards to primary shards, ensuring continuous access to your data. This feature also allows for rolling maintenance operations without causing downtime, as the database remains available through the replicas even when some nodes are temporarily offline for maintenance.
Replica shards are also used for querying and aggregating data and therefore enhance the overall query performance.
Choosing the correct partitioning strategy means finding a balance between the granularity of partitions and the overall number of partitions, which impacts the cluster's performance and manageability. Some factors to be considered are:
Let’s consider an example with a three node CrateDB cluster. If you estimate to collect 700 megabytes of data each day, this translates to almost 5 gigabytes of data per week, 21 gigabytes of data per month and 253 gigabytes of data per year. Given these parameters and one year retention period you could choose to partition your data yearly. For more than one terabyte of data one should consider monthly partitions.
The ideal partitioning strategy for time series data in CrateDB is not static; it should evolve as your data volume grows and your requirements change. The potential deletion or archiving of data also influences the partitioning strategy. For example, if you want to delete large chunks of data it is beneficial to partition data accordingly and drop, for example, monthly partitions instead of executing delete statements.
When configuring sharding for time series data, it is important to consider several factors such as the expected amount of data, number of partitions and cluster configuration including number of nodes and the number of CPUs per node. A good rule of thumb is to have as many shards as there are CPUs in the cluster. This increases the chance to get a maximum of distribution and parallelization of queries. In our example, let‘s consider a three-node cluster where each node has three CPUs. The total amount of data over a course of the year is predicted to be around 253 gigabytes and we decided to have one partition. In this situation, we can choose to have nine shards, resulting in three shards per node, where each shard contains 28 gigabytes of data. It is worth mentioning that it is important to monitor the performance of your cluster as data grows and adjust the number of shards as needed to ensure optimal performance over time.
CrateDB also supports high availability, which is crucial for maintaining uninterrupted service and preventing data loss. In CrateDB, replication is configurable on a per-table basis. This gives you the flexibility to set the level of replication that best suits your data size and workload. In our example, one replica is created per shard. This ensures that a copy of your data is always available, enhancing the reliability of your database and safeguarding your data against potential failures.