Advanced - Time Series

Menu

Lesson 4: Time Series Data Modeling

CrateDB Academy: Time Series Video

 

Video Transcript

​We will now demonstrate how CrateDB is far more than a pure time series database, supporting time series, but also JSON, Geospatial, Vectors, and Full-Text Search. There will be a demonstration of standard SQL, and how the different data types can be queried is demonstrated. Examples, mainly based on weather data, underpin these statements, like full-text search across textual attributes or user-generated comments over time. 

Relational and Time Series Data: 

We will start with time series data which follows a simple relational representation in CrateDB. The distinguishing characteristics of time series data include timestamped records and are typically append-only.   

CREATE TABLE IF NOT EXISTS weather_data (
    ts TIMESTAMP,
    location VARCHAR,
    temperature DOUBLE,
    humidity DOUBLE CHECK (humidity >= 0),
    PRIMARY KEY (ts, location)
) WITH (column_policy='dynamic');
 

Consider a table named weather data. In this table, we have columns for timestamp, location, temperature, and humidity. The humidity should also automatically be checked to be non-negative. The primary key is a combination of ts and location. We also want to use a dynamic column policy, that means to automatically add new columns at runtime. 

Now, let's see how we can insert data into this table. The INSERT INTO command is used to insert a record with a timestamp, location, temperature, and humidity values for San Francisco.  

INSERT INTO weather_data (ts, location, temperature, humidity) VALUES ('2024-01-12T14:30:00', 'San Francisco', 17.5, 60.0); 

 

And we can view the data in our table using a SELECT command.   

SELECT * FROM weather_data; 

Now, let's say we want to add a new column wind speed to our table. Because we have set the column policy to dynamic, we can add this new column directly through an insert into command. This new column will be automatically added to our table.  

INSERT INTO weather_data (ts, location, temperature, humidity, wind_speed) VALUES ('2024-01-12T14:31:00', 'San Francisco', 17.5, 60.0, 35.2); 

We can view the updated data in our table using the SELECT command. The new column is automatically indexed, so running a search query will be efficient, even across large data volumes. 

SELECT * FROM weather_data WHERE wind_speed > 30;  

Document/JSON Data:  

CrateDB offers support for JSON. JSON is a commonly used format for data exchange, making it ideal for importing and exporting time series data. It also allows for the storage of arbitrary contextual data, providing additional details that can easily be appended to any time series data set. This added context can often prove helpful when analyzing patterns and trends. The OBJECT data type in CrateDB allows for a strict, dynamic, or ignored schema, depending on the specific use case. This means you can insert JSON directly as a string, and any new attributes will be automatically indexed. This combination of flexibility and power makes CrateDB an excellent choice for managing time series data. 

In this example, we further enhance our weather data table by adding contextual data. We use the ALTER TABLE command to add a new column sensor description of type object dynamic to store JSON data. This allows us to include additional details like quality of measurement and battery life of the sensor. We then insert a record with these new attributes into the table.  

CrateDB's dynamic schema allows us to immediately run a query to find devices with low battery. The SELECT statement identifies distinct locations where the battery life from the sensor description is less than 40.    

ALTER TABLE weather_data  ADD COLUMN sensor_description OBJECT(DYNAMIC); 

INSERT INTO weather_data (ts, location, temperature, humidity, wind_speed, sensor_description) VALUES ('2024-01-12T14:32:00', 'San Francisco', 17.8, 61.0, 33.2, '{"quality":90, "battery_life":35}'); 

SELECT DISTINCT(location), sensor_description['battery_life'] as battery_life FROM weather_data WHERE sensor_description['battery_life'] < 40 limit 100; 

To use JSON data for time series directly, we create a new table called weather_data_json with columns for timestamp, location, and vals which is of type OBJECT.    

CREATE TABLE weather_data_json (
    ts TIMESTAMP,
    location VARCHAR,
    vals OBJECT,
    PRIMARY KEY (ts, location)
);
 

We then insert a record into the table with a timestamp, location, and values for temperature and humidity in JSON format. Using a SELECT command, we can view the data in our table.    

INSERT INTO weather_data_json (ts, location, vals) VALUES ('2024-01-12T14:30:00', 'San Francisco', '{"temperature":17.5, "humidity":60.0}'); 

SELECT * FROM weather_data_json; 

We can also add more attributes to our JSON data on the fly, such as wind speed, by simply including it in the INSERT INTO statement. The updated data, including the new wind speed attribute, can be viewed and filtered by immediately.    

INSERT INTO weather_data_json (ts, location, vals) VALUES ('2024-01-12T14:31:00', 'San Francisco', '{"temperature":17.5, "humidity":60.0, "wind_speed":35.6}'); 

SELECT * FROM weather_data_json; 

Full-Text Search: 

In addition to numerical and categorical data, CrateDB is also highly effective in handling textual data. This could include attributes such as machine status, user comments, or log files, which can be valuable sources of information for analysis. Furthermore, CrateDB is equipped with robust full-text search capabilities. This allows for efficient searching through these textual data, not just for exact matches, but also for typos or specific phrases and even synonyms. With CrateDB, you can efficiently perform a full-text search through your data and extract the insights you need. 

Let us have a look at a table for web server logs. It contains a timestamp, IP address, request method, the called HTTP endpoint, a response status and the user agent.   

CREATE TABLE web_server_logs (
    ts TIMESTAMP,
    ip_address TEXT,
    request_method TEXT,
    endpoint TEXT INDEX USING FULLTEXT,
    response_status INT,
    user_agent TEXT
);
 

We insert a few rows calls to the home, login and products endpoints.   

INSERT INTO web_server_logs (ts, ip_address, request_method, endpoint, response_status, user_agent)
VALUES
('2024-01-12T08:00:00', '192.168.1.1', 'GET', '/home', 200, 'Mozilla/5.0'),
('2024-01-12T08:05:23', '192.168.1.2', 'POST', '/login', 303, 'Mozilla/5.0'),
('2024-01-12T08:10:45', '192.168.1.3', 'GET', '/products', 200, 'Mozilla/5.0');
 

Imagine, users can type search criteria that will be used in the queries. Instead of typing home, two letters are mixed up. The fuzzy search capabilities in CrateDB can easily cope with these misspellings. This query uses a fuzziness of five, which means there can be a Levenshtein distance of five or five letters are allowed to be different in the search term and the actual term in the database.   

SELECT * FROM web_server_logs  WHERE MATCH(endpoint, 'hmoe') USING best_fields WITH (fuzziness=5); 

Of course, full-text search can be combined with other attributes as well. For example, all successful HTTP calls to the home endpoint can be found by adding the condition of response_status equals 200.   

SELECT * FROM web_server_logs
WHERE MATCH(endpoint, 'hmoe')
USING best_fields
WITH (fuzziness=5) AND response_status = 200;
 

There are many more options available, please refer to the CrateDB documentation for all the details on full-text search. 

Vector Search: 

CrateDB supports the float vector data type for columns, which is ideal for storing floating point vectors. This data type can handle up to 2048 dimensions, accommodating a wide array of data complexities. To efficiently index these vectors, CrateDB utilizes a Hierarchical Navigable Small World graph, an efficient index for searching similar vectors in high dimensional spaces. An Approximate k-Nearest Neighbor Search can be done with the k n n match function based on an euclidean distance function. This combination of features enhances CrateDB's ability to handle complex, multi-dimensional data. The KNN match can be combined with other conditions as well. 

Vectors are very prominent nowadays when it comes to semantic search, where we also see a big area of application. In this example, we focus on finding similar patterns as a common problem in time series analysis. Although there are multiple different ways to identify similar patterns, executing a similarity search on a vector is a viable solution. One could create vectors of various features like minimum or maximum values, moving averages etc. In our example we want to find weeks with similar temperature patterns. Therefore, we create a table called weather patterns weekly that stores the location, the particular week and the maximum temperatures of every single day in a vector. 

CREATE TABLE weather_patterns_weekly (location TEXT, week TIMESTAMP, max_temp_pattern FLOAT_VECTOR(7)); 

To populate this table, we aggregate the temperature values of our weather data table in the following way. The common table expression of this statement truncates the date of measurements to weeks and days and calculates the maximum temperature per location, week, and day. The outer select statement groups the data by location and week to aggregate the daily maximum values into a float vector. The resulting data is inserted into the weather patterns weekly table. 

INSERT INTO weather_patterns_weekly
WITH daily_temperatures AS (
     SELECT location, date_trunc('week', timestamp) w, date_trunc('day', timestamp) d, max(temperature) t_max
     FROM weather_data
     GROUP BY location, w, d
     ORDER BY location, w, d
)

SELECT location, w, array_agg(t_max)::FLOAT_VECTOR FROM daily_temperatures
GROUP BY location, w;
 

As we can see, the result contains a vector of daily temperatures per location and week. 

SELECT * FROM weather_patterns_weekly ORDER BY location, week; 

In order to find similar weeks, we execute a K nearest neighbour search for the week of twentieth February 2023 represented by the given timestamp. In order to do so, we select the temperature pattern of the week of interest and execute a similarity search on the temperature pattern and want to find the top 10 most similar weeks. The score value represents the similarity. Please note that this value is not the actual distance of vectors, but an arbitrary value to rank the results. For more details on the score value, please refer to the documentation. 

SELECT *, _score
FROM weather_patterns_weekly
WHERE knn_match(
    max_temp_pattern,
    (SELECT max_temp_pattern FROM weather_patterns_weekly WHERE location='Berlin' and week = 1676851200000),
    10
) ORDER BY _score DESC
LIMIT 10;

 

The results provide us with various weeks that have a similar temperature patterns.  

If we look at a visualization of the similar data sets, we can see that the weekly temperature patterns look similar to each other. Obviously, this example is very simple and for a real comparison of patterns we should do a proper normalization of data first. Still, it shows how the similarity search can be used to detect patterns. 

The vector search can be used for a lot of different cases, especially when creating embeddings of textual data, videos, audio files or other binaries. This is not part of this course, please refer to the CrateDB documentation and other tutorials, for example, about CrateDB’s lang chain integration on how to work with large language models. 

Geospatial Data:  

Geospatial data plays a critical role in time series solutions, offering a spatial context to the time-stamped data. This data type allows us to track changes over time in a particular geographic area, which can be vital for numerous applications. For example, in the transportation industry, geospatial data can help track vehicle routes and monitor traffic conditions over time. In environmental studies, it can be used to monitor changes in weather patterns or wildlife movements. By integrating geospatial data into time series analysis, we can uncover spatial trends and patterns, enhancing our understanding and enabling more informed decision-making. 

For this introduction, let us create a simple table for the cities that are referenced in the weather data. 

CREATE TABLE cities ( name TEXT, country TEXT, location GEO_POINT ); 

We want to insert a few rows into the new table. 

INSERT INTO cities (name, country, location) VALUES
    ('Redwood City', 'United States', [-122.2364, 37.4852]),
    ('Berlin', 'Germany', [13.4050, 52.5200]),
    ('Dornbirn', 'Austria', [9.7433, 47.4125]),
    ('Vienna', 'Austria', [16.3738, 48.2082]),
    ('Zurich', 'Switzerland', [8.5417, 47.3769]);
 

Selecting this data shows that CrateDB interprets the data correctly and, for example, opens the coordinates on OpenStreetMap when clicking on the arrow icon. 

SELECT * FROM cities;   

The geospatial data can now be used, for example, to visualize weather conditions on a map or automatically figure out if, for example, storm events will cross a certain country that is defined via its geo shape. All of it can be done via the available functions to query geo data directly in CrateDB. 

Binaries: 

Binary Large Object or BLOB is another key data type that can be effectively managed in time series solutions. This type of data can range from images and audio files to logs or even serialized machine learning models. Storing and retrieving BLOB data efficiently is crucial in many time series applications, especially when dealing with multimedia content or large log files. CrateDB supports BLOB storage, allowing users to store and retrieve binary data efficiently, making it an excellent choice for managing complex time series data with various data types. 

They are stored in separate tables which allows for different disk configurations than the regular database files. Upload and download can be executed via the HTTP interface of CrateDB. All metadata of binaries is available via SQL. 

As an alternative, binaries can also be stored in text columns using a base64 encoding. 

This closes our overview of how CrateDB handles various datatypes that are relevant for comprehensive time series data management.