Advanced - Time Series

Menu

Lesson 10: Managing Time Series Data Lifecycle

CrateDB Academy: Time Series Video

 

Video Transcript

Managing the lifecycle of data is crucial for maintaining performance and controlling storage costs, especially when dealing with time series data. With CrateDB, you can design a flexible approach to handle data retention. Policies can be configured to manage data based on age, size, or other criteria. This ensures that your database remains optimized and cost-effective. 

One of the features that makes CrateDB robust for handling large datasets is its support for table partitioning. Let's take a look at the code snippet. We're creating a table called t1 with two columns: name and month. As described in an earlier session, we are using the PARTITIONED BY (month) clause to split the data into separate partitions for each month. This partitioning strategy is particularly useful for time-series data, where you can query and manage data more efficiently by working with smaller, more relevant subsets. 

When we insert data into t1, as shown in the INSERT INTO statement, new partitions are automatically created for new months on the first data ingest. Moreover, when it's time to remove old data, as demonstrated by the DELETE FROM statement, we can drop entire partitions without the costly process of index rebuilds. This makes it very easy to manage historical data and ensures that our database remains fast and efficient. 

By partitioning data, you can strategically remove outdated or irrelevant data without affecting the rest of the dataset. This helps in reducing storage costs since you only retain the data that is necessary for your business needs. Now let’s take a look at how to define a retention policy to automate the process of purging old data. 

CrateDB can operate on a cluster with a mix of nodes that have different storage types, such as hot (for frequent access) and cold storage (for less frequently accessed data). To define a node’s storage type within the cluster you can use node.attr.storage attribute for each node. Retention policies in general requires the use of partitioned tables which facilitate faster data pruning and management. In this example, the retention_policies table illustrates how to define retention policies that will be later applied in the CrateDB cluster by orchestration or automation tooling. Inside this table we keep the information about:   

  • table_schema, which is the schema name of the table (typically doc for the default schema)
  • table_name, denoting the name of the table to which the retention policy will apply
  • partition_column or the column used to partition the table. In our example, it is the month retention period which specifies the period for which data should be retained
  • The strategy column allows for multiple policies once the data exceeds the retention period. One policy can be the reallocation of data to another node or a complete data deletion.  

The INSERT statement shows a policy for the t1 table in the default doc schema, where data is partitioned by month, and data older than 3 months will be subject to the reallocate strategy. This means that data will be moved after 3 months within the cluster, typically from hot to cold storage. In the next examples, we will see how these retention policies can be implemented. 

The first statement shows how to configure a partition on hot nodes and later move it to cold storage. As the example suggests, the created table is partitioned by the month column and the WITH clause specifies that new partitions should be stored on nodes with hot storage. Hot storage is typically used for data that is accessed frequently and requires fast read/write operations. 

To relocate affected partitions to cold nodes one should use the ALTER TABLE statement. The partition clause identifies the partition to be altered based on the month value. Finally, the SET clause updates the routing for the specified partition, directing it to nodes with 'cold' storage using the setting routing.allocation.require.storage = 'cold'. Cold storage is used for less frequently accessed data, which can be stored more cost-effectively. CrateDB’s cluster automatically initiates the relocation of the affected partitions once the settings are altered. Partitions are seamlessly moved to a node that fulfills the specified requirements. 

Optionally, once the data has reached the end of its retention period and is no longer needed the DELETE FROM statement is used to remove data from the table. The combination of partitioned tables and retention strategies in CrateDB enables efficient data lifecycle management. By assigning partitions to hot or cold nodes and setting up automatic deletion, CrateDB users can optimize their cluster's storage and performance. 

Backup snapshots can be used to archive old partitions. Snapshots capture the state of a table or a series of tables at the precise moment it is created. Each partition can be backed up and restored individually, if needed, and therefore enables another instrument how to handle historical data that must be retained, but does not need to be available in the hot or warm database. Snapshots are stored in so-called repositories, which function as storage containers for them. 

A repository is configured with a specific storage backend, which could be Amazon S3, Microsoft Azure Blob Storage, Google Cloud Storage, or a local filesystem. The selection of the backend dictates the method and location of snapshot storage. For instance, when utilizing S3, the snapshots are preserved as objects within an S3 bucket. 

To establish a repository for storing snapshots in an S3 backend, one would employ the CREATE REPOSITORY statement in CrateDB. In our given example, we create a repository named export_cold, and we utilize S3 as the storage backend. The WITH clause encompasses all the requisite configuration details such as protocol, endpoint, access credentials, and the name of the bucket. 

Upon the creation of the repository, you can proceed to create snapshots. A snapshot is generated with the CREATE SNAPSHOT statement, which stipulates which tables and partitions to include. This statement can also incorporate options such as whether to await the completion of snapshot creation before returning control to the user. In our example, we create a snapshot for a specific partition of the table t1. Only the data from the partition where the column date has the value 2023-01-01 will be included in the snapshot. The option wait_for_completion=true indicates that the statement will not return until the snapshot is fully created. 

Restoring snapshots reverts the state of your tables or entire cluster to a previous point in time. This can be critical for recovery from data loss or corruption, or in our example to restore archived data in case you need to analyze it again. 

Here is how you can restore data in CrateDB: 

  1. Restore a Single Table Partition: If you need to restore just a single partition of a table, you can use the RESTORE SNAPSHOT command with the PARTITION clause. Our example shows how to restore the partition for January 2023 from a snapshot named snapshot1 in the repository export_cold.
  2. Restore an Entire Table: To restore an entire table, omit the PARTITION clause. The second example with restore all partitions from table t1.
  3. Restore the Entire Snapshot: If you want to restore everything that the snapshot contains, use the ALL keyword as illustrated in the third example. This restores all the tables, partitions, and possibly other cluster metadata contained in the snapshot 

To ensure a smooth restoration process in CrateDB, if you're restoring a complete table, it should have a unique name within the cluster. Utilize renaming strategies or schema replacement options during the restore process for any tables that need to be uniquely identified. This approach helps to maintain a consistent database structure and avoid any conflicts with existing tables.  

If you want to restore all tables from the snapshot without the metadata, you can use the TABLES keyword. In addition to table data, you can also restore cluster metadata, which includes views, cluster settings, and other configurations stored in the snapshot.  

When you no longer need a snapshot, or if you wish to delete the repository, you can use the DROP SNAPSHOT or DROP REPOSITORY command, respectively. This helps in managing the storage space and keeping the repository organized by removing obsolete snapshots and repositories. 

Remember that managing snapshots and repositories should be part of your regular database maintenance and disaster recovery planning. Always ensure that you have recent and valid backups and test your restore procedures regularly to ensure they work as expected.