Time series data often presents unique challenges due to its large volume and temporal nature. Therefore, optimizing queries becomes crucial for maintaining performance and scalability. In this video, we'll explore essential techniques for efficiently querying time series data in CrateDB.
To find bottlenecks and potential optimization opportunities in your time-series queries consider using the EXPLAIN ANALYZE
command. When you prepend EXPLAIN ANALYZE
to your SQL query, CrateDB not only generates the execution plan, but also runs the query and captures detailed runtime statistics. This includes the time spent in various operations such as parsing, planning, and the actual execution of the query. For CrateDB, which relies on Lucene indexes, EXPLAIN ANALYZE
can offer insights into how queries are executed using Lucene's indexing mechanisms.
In this example, we illustrate the time taken to run a SELECT
query where certain conditions are applied, filtering data based on temperature. The time number represents the query execution time in milliseconds, while the green labels represent execution times for individual operations within the overall query, such as data scans, joins, or aggregations.
These granular timings provide insight into the database's performance, helping us understand where optimizations can be made. For instance, if a particular operation is taking longer than expected, we might consider indexing strategies or query rewriting to enhance performance.
A full table scan, where the database engine examines every single row in a table, can lead to performance issues. It's a process that is both CPU and I/O intensive and can lead to inefficient execution when dealing with large, time-series datasets.
For instance, let's consider a query aimed at calculating the average temperature over a two-month period.
Executing it on a non-sharded and non-portioned table requires a full table scan.
Using partitioning and sharding avoids that all data needs to be scanned on the one hand and the aggregation can be executed in parallel on the different shards and partitions.
Let’s look at this example in practice. We first need to create a new table and import weather data:
CREATE TABLE IF NOT EXISTS "doc"."weather_data" (
"timestamp" TIMESTAMP WITHOUT TIME ZONE,
"location" TEXT,
"temperature" DOUBLE PRECISION,
"humidity" DOUBLE PRECISION,
"wind_speed" DOUBLE PRECISION
) CLUSTERED INTO 1 SHARDS;
COPY weather_data
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_weather.csv.gz'
WITH (format='csv', compression='gzip', empty_string_as_null=true);
Now, let’s create another table as a partitioned table based on time intervals, such as monthly partitions. By doing this, we enable the CrateDB to quickly return just the relevant data needed for the query, bypassing all irrelevant records.
CREATE TABLE IF NOT EXISTS "doc"."weather_data_partitioned" (
"timestamp" TIMESTAMP WITHOUT TIME ZONE,
"location" TEXT,
"temperature" DOUBLE PRECISION,
"humidity" DOUBLE PRECISION,
"wind_speed" DOUBLE PRECISION,
"ts_month" GENERATED ALWAYS AS date_trunc('month', "timestamp")
) CLUSTERED INTO 1 SHARDS
PARTITIONED BY ("ts_month");
COPY weather_data_partitioned
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_weather.csv.gz'
WITH (format='csv', compression='gzip', empty_string_as_null=true);
Let’s run the query on both, partitioned and non-partitioned tables and use the EXPLAIN ANALYZE
command to better understand which partitions have been queried.
The execution trace of the query on the non-partitioned table looks like the following:
EXPLAIN ANALYZE SELECT
date_trunc('month', "timestamp") AS "month",
AVG("temperature") AS "avg_temperature",
AVG("humidity") AS "avg_humidity",
AVG("wind_speed") AS "avg_wind_speed"
FROM "doc"."weather_data"
WHERE "timestamp" >= '2023-01-01' AND "timestamp" < '2023-03-01'
GROUP BY "month"
ORDER BY "month";
Here we can see that one partition was queried. We can run the same query on partitioned table:
EXPLAIN ANALYZE SELECT
ts_month,
AVG("temperature") AS "avg_temperature",
AVG("humidity") AS "avg_humidity",
AVG("wind_speed") AS "avg_wind_speed"
FROM "doc"."weather_data_partitioned"
WHERE "timestamp" >= '2023-01-01' AND "timestamp" < '2023-03-01'
GROUP BY "ts_month"
ORDER BY "ts_month";
Now as we inspect the query plan, we can see that the second query runs on 3 partitions and the total execution time was reduced significantly.
A correct partitioning approach streamlines query execution and significantly reduces the time and resources required to fetch the desired data.
With the next example we are going to examine the impact of utilizing indexes on query performance in CrateDB. The query returns the average temperature for each month and considers only temperature records higher than 20 and lower than -20 degrees. To examine the execution phases of this query we run it with EXPLAIN ANALYZE
.
The query employs the abs()
function in the WHERE
clause. During the query execution, abs()
triggers a generic function scan, it doesn't directly benefit from the indexes on the temperature column which leads to a full-table scan. It means that the abs function requires evaluating the absolute value of the temperature of each single record which impacts the overall query performance.
Changing the query to use direct comparison operators allows to use indexes. By running EXPLAIN ANALYZE
we can observe that CrateDB now runs multiple queries which utilize underlying indexes to quickly locate and retrieve the relevant temperature records.
This is one example where the right choice of predicates in your SQL queries can have a significant impact on performance. The execution trace shows you if a generic function query is executed of an index can be leveraged without evaluating each individual record. This has been a very simple example, please check the execution trace with your own queries. Similar effects can be achieved, for example, when sorting data. Using an index for sorting is highly beneficial over sorting in memory.
In CrateDB, an array is a data type that allows you to store a sequence of elements of the same type. Because they can handle multiple values in a single, ordered structure, they are particularly useful for batch processing and the efficient representation of data sets. In the context of CrateDB, which is optimized for handling large-scale and time-series data, arrays can be used to store sequences of measurements which can significantly reduce the overall storage footprint. Let’s take a look at an example:
The CREATE TABLE
statement for sensor_readings
in CrateDB is set up to store each sensor reading as an individual record. It includes a timestamp with a time zone for when the reading was taken, a text field for the sensor ID, a double-precision floating-point number for the battery level, and a text field for the battery status. This typical setup is suitable for scenarios where the frequency of data insertions is relatively low, or the storage and performance costs are not a primary concern. However, it can become inefficient when dealing with very large datasets, as each sensor reading is stored as a separate row, which can lead to a higher storage footprint and potentially slower query performance due to the higher number of rows to process.
To tackle the challenge of efficiently storing a massive dataset, containing 120 million records which occupy 6 GiB of space we applied several optimizations:
We restructured time data to be modeled as an array.
The time_bucket
field is truncated to the day level.
Similarly, the actual sensor measurements are stored in arrays. This means that instead of one row per measurement, you have one row per time bucket.
A more aggressive compression algorithm (e.g., best_compression
) is applied to the table, taking advantage of the array structures.
Indexes on the array fields are turned off to reduce the storage space used by indexes.
By implementing these changes, the storage requirement was dramatically reduced from 6 GiB to 1.1 GiB, achieving an 80% reduction in storage space. This example illustrates how to leverage the array data type to reduce storage requirements and potentially improve query performance for time-series use cases.
Another vital element of CrateDB's comprehensive feature set is the Common Table Expression, also known as CTE. CTEs serve as a potent instrument for enhancing the performance and readability of queries within CrateDB. In certain scenarios, such as those involving intricate queries that necessitate numerous filtering and aggregation steps, or when there's a need to repeatedly use a data subset within a single query, CTEs can provide performance advantages over traditional JOIN operations.
Suppose we have three tables: the sales
table records each sale transaction, the sale_items
table details items within each sale and the product
table which maintains information on each product. Now we want to find the total quantity sold for a specific product within a given date range. To calculate the total quantity of the product named Gadget sold within the date range from January 1, 2024, to February 1, 2024 the query joins sales
, sale_items
, and products
tables based on their respective IDs. Even though this is a perfectly valid SQL it may not be the most efficient strategy.
Even without considering the complexities of a distributed system, parallel processing, and disk/memory options there are still many different possible strategies here, for instance: “Gadget” product may be sold rarely, we could then start by looking up its product_id
, then locate all instances of its sales in the sale_items
table, and finally narrow down those sales to the specific date range in January 2024. If "Gadget" is a popular product in a dataset covering many years of sales, starting the query by selecting all sales made in January 2024 from the sales table, then identifying which of these sales include "Gadget" could be a more effective strategy.
The impact of using a suboptimal execution plan here could be significant, as we could find ourselves trying to JOIN
millions and millions of records.
In our example, we consider the specific use case where our product 'Gadget' is a niche item with infrequent sales. The modified query we have constructed using Common Table Expressions efficiently isolates the relevant product ID first, reducing the subsequent search space to just the sales of this specific item before checking the corresponding sales records for the selected date range.
In the relevant_product_ids
CTE, we limit our initial search to the minimal set of relevant product entries, avoiding a full scan of the potentially large sale_items
table.
We then focus on the sale transactions associated with these product_ids
, which is expected to be a small subset given the product's rarity, thus speeding up the process.
Finally, by filtering the sales data for these specific sale_items
, we ensure that we only consider the sales transactions that happened within our desired date range, January 2024.
This stepwise refinement ensures that each subsequent join operation deals with the smallest possible dataset, leading to faster query performance and more efficient use of resources.
Overall, CTEs make the query more readable and easier to maintain, especially for complex queries involving multiple steps. Furthermore, CTEs can keep the intermediate results, which can be beneficial if the same subset of data is used multiple times in the query, reducing the number of scans over the original table.
Mastering time series query optimization in CrateDB is essential for ensuring efficient data processing. By integrating the strategies highlighted in this video, you'll be well-positioned to achieve optimal performance for your time-series analytics in CrateDB.