Advanced - Time Series

Menu

Lesson 2: Time Series Data in CrateDB

CrateDB Academy: Time Series Video

 

Video Transcript

Let's dive into how CrateDB handles time series. We will show and explain how CrateDB helps to query data in milliseconds, leverage the simplicity of SQL, combine multiple data types, handle high ingest rates, and store years of historical data. 

Weather data will be frequently used in our examples as it is easy to comprehend and available from public sources. As necessary, we will incorporate other data sets to best exhibit the extensive capabilities of CrateDB in managing and analyzing time series data. For most of the examples, we use rather small data sets for demonstration purposes. In later sessions, we will also cover how to work with larger than memory data sets. 

In this video, we will create a data table and populate it with weather readings from locations across the globe. Each record includes the exact time of the recording, the location of the weather station, the temperature in degrees Celsius, the relative humidity, and the wind speed in kilometres per hour. We will then showcase the power of CrateDB's querying capabilities, from basic data extraction to more complex, real-time aggregations. This will highlight the versatility and usability of CrateDB in managing, analyzing, and extracting insights from time series data. 

First, let’s create the table. To store the weather data, create a table with columns tailored to the dataset using the CREATE TABLE command. Although we are using only a few variables here, CrateDB can easily accommodate hundreds of columns in a single table.

CREATE TABLE "weather_data" (
     "timestamp" TIMESTAMP,
     "location" VARCHAR,
     "temperature" DOUBLE,
     "humidity" DOUBLE,
     "wind_speed" DOUBLE CHECK (wind_speed >= 0)
);

 

With the table ready, you are now set to insert the dataset. As the data is available in a publicly readable location, you can easily insert it using the COPY FROM SQL statement.

COPY weather_data
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_weather.csv.gz'
WITH (format='csv', compression='gzip', empty_string_as_null=true);
 

As you can see, the data set only contains seventy thousand records for educational purposes. Real-world data sets are usually much larger and we will demonstrate those in later sessions. Now that the data is successfully imported, let us start to explore it. We start with a basic select statement on all columns and limit the output to display the first 10 records only. 

SELECT *
FROM weather_data
LIMIT 10;
 

You see the first values with all previously created measurements. 

CrateDB is built for fast aggregations using columnar storage, to speed up aggregations across large datasets and long periods of time. For example, we want to calculate the average temperature for each location by using the average aggregation function.  

SELECT location, AVG(temperature) AS avg_temp
FROM weather_data
GROUP BY location;
 
Computing basic averages is not special, but what if you need to answer more detailed questions? For example, if you want to know the highest temperature for each place and when it occurred. 
 

Simple groupings might not be enough and CrateDB has enhanced tools for time series data. You can use the max_by function, which gives you a value, like the time, when another value, like the temperature, is at its highest.   

SELECT location,
    max(temperature) AS highest_temp,
    max_by(timestamp, temperature) AS time_of_highest_temp
FROM weather_data
GROUP BY location;
 

Let’s look at the example query. We now see when the peak in temperature occurred for every location together with the respective maximum temperature. 

You have probably observed by now, that there are gaps in the dataset for certain metrics. Such occurrences are common, perhaps due to a sensor malfunction or disconnection. To address this, the missing values need to be filled in. While there are many ways how to deal with missing values, we want to interpolate based on the previous and the following non-empty value. You can use another statement of CrateDB’s SQL, window functions, paired with the ignore nulls feature. Within a Common Table Expression, we utilize window functions to spot the next and prior non-null temperature recordings, and then compute the arithmetic mean to bridge the gap. 

WITH OrderedData AS (
     SELECT timestamp,
         location,
         temperature,
         LAG(temperature, 1) IGNORE NULLS OVER (ORDER BY timestamp) AS prev_temp,
         LEAD(temperature, 1) IGNORE NULLS OVER (ORDER BY timestamp) AS next_temp
     FROM weather_data
)

SELECT timestamp,
    location,
    temperature,
    (prev_temp + next_temp) / 2 AS interpolated_temperature
FROM OrderedData
ORDER BY location, timestamp
LIMIT 50;
 

This query retrieves the missing temperature values and interpolates them by averaging the previous and next available temperature readings. 

Let us summarize how CrateDB differentiates for time series data. 

It is a SQL database designed for large amounts of data. It offers built-in time-series functionality, including window functions and aggregation functions, which simplify querying and analysis. Functions such as lead, lag, and date bin, as well as joins, allow for advanced data handling and will be discussed in detail in the following sessions. 

Beyond time series data, CrateDB supports various data types, including structured, unstructured, JSON, geospatial, BLOB, and vector data. Its columnar storage format is efficient for time series data, enabling complex ad-hoc aggregations within milliseconds across millions and billions of records. The robust time-based partitioning and sharding enables high cardinality without the need for data pre-aggregation or down-sampling. 

CrateDB's distributed architecture enables high availability and horizontal scaling across multiple nodes, ideal for handling large volumes of data from diverse sources. It also supports data tiering, allowing for cost-efficient long-term data storage. Integration with popular data visualization tools and compatibility with Java and Python ecosystems facilitate efficient data analysis and visualization. As an open-source solution, CrateDB offers cost-effective licensing and an active community for support.