Advanced - Time Series

Menu

Lesson 8: Sharding and Partitioning for Time Series

CrateDB Academy: Time Series Video

 

Video Transcript

Sharding and partitioning are essential aspects of storing time series data. Partitioning allows data to be subdivided into smaller, more manageable pieces, enhancing read and write performance. Sharding is a strategy used to distribute the data horizontally across multiple nodes, thereby improving the speed of queries and updates. 

Partitioning in CrateDB is a powerful feature that allows you to split up a large table into smaller chunks, or partitions. By doing so, CrateDB minimizes the number of records that need to be scanned during a query, which can significantly improve performance, especially for large datasets. 

For time series data you usually use a timestamp or date column to define partition criteria. For example, you can partition data by month, as shown in the provided SQL statement for creating the weather_data table. In this table, the month column is a generated column that uses the date_trunc function to extract the month from the timestamp column. The resulting table is partitioned by the month column, which means that each partition will contain records for each individual month only. For quarterly, daily or hourly partitioning you would need to adjust the generated column expression to truncate the timestamp to a specific interval. 

New partitions are automatically created on the first data ingest that falls into a new partition interval. You do not need to manually manage the creation of partitions as your data grows. Instead, CrateDB handles this for you, creating new partitions whenever they are needed based on the partitioning strategy you have defined. 

Partitions in CrateDB offer a level of granularity that allows for individual management and maintenance. This means that each partition can be backed up, closed, deleted, or archived individually. This flexibility provides efficient data management, particularly important in handling large time series data. For example, if a specific partition contains outdated or irrelevant data, it can be removed without impacting the rest of the database. Similarly, individual backups allow for specific data recovery, while archiving can help optimize storage and improve overall performance. 

Sharding in CrateDB is a fundamental concept that allows for the horizontal scaling of data storage and distributed query processing. When you create a table in CrateDB, the data is not just stored in a single monolithic structure; instead, it is split into shards, which are distributed across the nodes in your cluster. 

For instance, consider the SQL statement for creating a weather_data table. The CLUSTERED INTO 3 SHARDS clause tells CrateDB to divide each partition into three shards. Each partition’s data will be split across the shards, and each shard will hold roughly one-third of the data. The number of shards is configurable, and you can decide how many shards to create based on the size of your dataset, the number of nodes in your cluster, and the amount of CPUs per node. As the data volume can grow over time, the number of shards per partition can be changed over time and therefore allowing to create more shards for more recent partitions. This is particularly useful when onboarding, for example, additional devices or weather stations. Another case might be gathering measurements in a higher frequency and therefore generating more data than initially planned. Increasing the number of shards might be necessary to ensure consistent performance. 

Sharding, combined with partitioning, provides two levels of data organization. In the provided SQL statement, the weather data table is not only partitioned by month, but it is also separated into three shards. This setup allows CrateDB to perform operations in parallel across different shards and partitions, which can significantly improve query performance, especially for large time-series datasets and concurrent queries. The performance can be increased by sharding on single-node instances as well, as queries and aggregations can be executed in parallel and existing hardware is utilized in the best way. 

The number of replicas in your CrateDB cluster is a crucial factor that depends on the availability of  Service Level Agreements of your application. It's recommended to have at least one replica for each shard, but two replicas would provide a better fault tolerance. CrateDB ensures that primary and replica shards are automatically distributed across nodes in the cluster, enhancing data availability and durability. In the event of node failures, CrateDB automatically promotes replica shards to primary shards, ensuring continuous access to your data. This feature also allows for rolling maintenance operations without causing downtime, as the database remains available through the replicas even when some nodes are temporarily offline for maintenance.  

Replica shards are also used for querying and aggregating data and therefore enhance the overall query performance. 

When implementing sharding and partitioning strategies for timeseries data in CrateDB, it's important to consider several best practices to optimize performance, manageability, and scalability.

To ensure optimal performance and resource utilization, you should consider the total number of shards.

This is determined by the number of partitions multiplied by the number of shards per partition multiplied by the number of replicas.

It's important to calculate this number carefully to balance query performance with resource overhead.

For example, with monthly partitioning, 3 shards per partition and 1 replica, you would have a total of 72 shards. 12 months times 3 shards, times 2 replicas.

Shard size is another crucial factor. The ideal size for each shard is between approximately 3 and 70 gigabytes.

Shards that are too small may result in unnecessary overhead, while shards that are too large may lead to performance bottlenecks due to limited parallelization of operations.

Also, each shard in CrateDB is stored as an individual Lucene index that consists of multiple segments and has a memory footprint.

To prevent excessive resource consumption, it's recommended that each node in your cluster should not host more than 1000 shards.

This helps to maintain the stability and responsiveness of the cluster.
 

Choosing the correct partitioning strategy means finding a balance between the granularity of partitions and the overall number of partitions, which impacts the cluster's performance and manageability. Some factors to be considered are: 

  1. The Amount of Time-Series Data: Consider the volume of data you expect to ingest into your CrateDB cluster. Larger datasets may benefit from more granular partitioning to prevent any single partition from becoming too large, which could slow down query performance.
  2. The Retention Period, that means the period of time you need to keep the data affects the number of partitions. For longer retention periods, you may need more partitions to keep the dataset manageable.
  3. The size of your CrateDB cluster, that means the number of nodes influences your partitioning strategy. More nodes can handle a larger number of partitions, distributing the load and allowing for parallel processing. However, too many partitions across too few nodes can lead to excessive overhead. 

Let’s consider an example with a three node CrateDB cluster. If you estimate to collect 700 megabytes of data each day, this translates to almost 5 gigabytes of data per week, 21 gigabytes of data per month and 253 gigabytes of data per year. Given these parameters and one year retention period you could choose to partition your data yearly. For more than one terabyte of data one should consider monthly partitions. 

The ideal partitioning strategy for time series data in CrateDB is not static; it should evolve as your data volume grows and your requirements change. The potential deletion or archiving of data also influences the partitioning strategy. For example, if you want to delete large chunks of data it is beneficial to partition data accordingly and drop, for example, monthly partitions instead of executing delete statements. 

When configuring sharding for time series data, it is important to consider several factors such as the expected amount of data, number of partitions and cluster configuration including number of nodes and the number of CPUs per node. A good rule of thumb is to have as many shards as there are CPUs in the cluster. This increases the chance to get a maximum of distribution and parallelization of queries. In our example, let‘s consider a three-node cluster where each node has three CPUs. The total amount of data over a course of the year is predicted to be around 253 gigabytes and we decided to have one partition. In this situation, we can choose to have nine shards, resulting in three shards per node, where each shard contains 28 gigabytes of data. It is worth mentioning that it is important to monitor the performance of your cluster as data grows and adjust the number of shards as needed to ensure optimal performance over time. 

CrateDB also supports high availability, which is crucial for maintaining uninterrupted service and preventing data loss. In CrateDB, replication is configurable on a per-table basis. This gives you the flexibility to set the level of replication that best suits your data size and workload. In our example, one replica is created per shard. This ensures that a copy of your data is always available, enhancing the reliability of your database and safeguarding your data against potential failures.