Indexing creates data structures that improve the speed of data retrieval operations in a database. In combination with sharding and partitioning, proper indexing allows for efficient data organization and fast data retrieval, crucial for effective time series data analysis.
To enable efficient aggregation, grouping and search over billions of rows, CrateDB combines multiple approaches: By default, all attributes are indexed based on their respective data types. This releases a lot of burden from the operations team to figure out an optimal indexing strategy that might need to be changed over time to ensure a high query performance for ad-hoc queries, that means queries that might not have been anticipated when designing the indexing strategy.
Memory-mapped columnar storage allows for fast aggregations over huge amounts of data due to better leverage of existing hardware and automatic optimization for data that is accessed very frequently.
Numeric values leverage blocked kd trees, which is an IO-efficient dynamic search tree. Textual data leverages an inverted index and full-text indexes can be created on-demand.
Vectors leverage an Hierarchical Navigable Small World index for efficient approximate nearest neighbour search using an Euclidian distance as default similarity measure. We will dig into more details on the following slides.
CrateDB uses columnar storage, which stores data table columns separately, as opposed to row-based storage which stores entire rows together. This arrangement offers several advantages for time series data:
In our example, the columnar store maps the value of record IDs to the temperature recorded at each weather station. With doc values, when a query requires aggregation, such as calculating the average temperature, CrateDB will directly access the ‘temperature' column's values. This direct access path significantly speeds up query execution. This approach is highly beneficial for time-series data, where datasets are often vast, and queries frequently involve aggregations over time intervals.
The combination of fast range scans as we will outline on the next slide, makes this approach very fast, even when combined with additional filter criteria.
CrateDB employs so-called Block KD-Trees, which are a special version of binary search trees, to enhance the indexing of numerical values, such as timestamps, IP addresses, geospatial data, and of course measurement values.
The way how the data structure is designed ensures high query performance. By dividing the search space in half at each step, they offer logarithmic search time complexity especially for filtering and range-queries which are very common in time series analytics, for example when selecting a certain range of timestamps, slicing by device IDs, or filtering for certain measurement values. When dealing with geospatial data, which is inherently multi-dimensional, the importance of Block KD-Trees becomes even more pronounced to handle complex queries across longitude and latitude as well as different levels of geospatial granularity.
In addition, BKD-Trees are especially good at indexing data that is stored on disks. For range queries and partial match queries, BKD-Trees help in quickly identifying the relevant disk blocks without scanning unrelated data. Furthermore, they partition data in such a way that related data points are stored close to each other on the disk. This spatial locality minimizes the number of disk pages that need to be read, as more relevant data is loaded with fewer I/O operations.
BKD-Trees remain efficient for high-volume and bulk loading operations, which is critical for time-series data where large volumes of data are inserted and kept in the database, which means the index constantly grows over time.
In CrateDB, an inverted index is a data structure that maps the content to its location in a table, or a document.
The plain index is used for exact matching. It's straightforward and fast for queries that know the exact ‘phrases' they are looking for. In our case, we have an index that maps each weather station to an 'ID', such as ‘Berlin West' to ID 1, ‘Berlin South' to ID 2, and ‘Zurich South' to ID 3. This type of index is ideal for direct and unambiguous lookups.
On the other hand, the full-text index supports more complex search operations, allowing for searches within the text for partial matches and variations. In the full-text index example, the weather station column has been indexed to identify the occurrence of terms across different stations. For instance, the term ‘Berlin' is associated with IDs 1 and 2, which correspond to ‘Berlin West' and ‘Berlin South', respectively. Similarly, ‘South' is linked to IDs 2 and 3, ‘West' to ID 2, and ‘Zurich’ to ID 3. This enables users to search for a term like ‘Berlin' and find all related stations in the city of Berlin, regardless of their exact names.
Full-text search, which goes beyond simple pattern matching, is possible by creating full-text indices, which can also be composite index, allowing the indexing of multiple text columns together.
The example demonstrates how to create a table with a composite full-text index on the 'name' and 'description' columns, and then use that index to perform a search. Using custom analyzers, we can refine the text tokenization process to different languages or specialized requirements. In our example, we’ve used a standard analyzer tailored to the English language. Following the table's creation, we perform a full-text search, where the term ‘key station' is used to query the stations table – we want to identify important weather stations for which we can find a text like ‘key station’ in the name or description field. The output is sorted by a relevance score, which assesses the degree to which each entry matches the search term. Notably, 'Berlin West' ranks higher than 'Berlin South', indicating a stronger association with the phrase 'key station' as per the full-text search's analysis. We can see the difference in the description attribute, both of them contains the term ‘station’, but only Berlin West is called a ‘key weather station’.
CrateDB uses standard SQL making it easier to integrate full-text search with other types of filters and query criteria and it can execute a variety of full-text searches. Fuzzy searches allow for a margin of error, accommodating slight misspellings or variations in the search terms. Phrase searches target exact sequences of words, and attribute boosting prioritizes certain text over others, enhancing the relevance of search results.
With the presented set of indexes, CrateDB can efficiently support advanced time series use cases: from filtering and slicing the data to advanced full-text and geospatial searches.