Advanced - Time Series

Menu

Lesson 3: Real-Life Challenges with Time Series

CrateDB Academy: Time Series Video

 

Video Transcript

In this section, we will explore the real-life challenges of time series data: Volume, Velocity, Variety, and Veracity, and discuss how CrateDB effectively addresses these challenges. 

Let's discuss the inherent four V’s of time series data and how we navigate through these challenges: 

Volume: 

Time series data is known for the vast amount of data it generates. For tasks like root cause analysis or precise forecasting, it's crucial to maintain granular data. Downsampling isn't an option here, as it would result in the loss of important details. 

Velocity: 

The speed at which data is generated and processed is another significant factor. High-performant queries and aggregations are a must, and we need to be able to act on incoming data in real-time. 

Variety: 

Time series data comes in various types and formats, ranging from structured to semi-structured and unstructured data. This variety needs to be effectively handled by the technology we use. 

Veracity: 

Finally, the quality and reliability of the data are critical. Given the high volume and velocity of time series data, ensuring data veracity can be a significant challenge. 

How do these four V’s manifest in a data architecture? 

Depending on the use case of your project, various data sources might be relevant. These can range from enterprise data from ERP or CRM systems, existing analytical data, sensor data, APIs, data lakes, or other databases. The applications intended to be built can vary from traditional applications that run browser-based or on mobile phones, to real-time analytics, visualizations, predictions, and natural language applications like chatbots. Operational tasks that support these efforts, like MLOps, also come into play. 

A project typically starts with a time series database that imports data from sensors or streams. New applications can be built in a timely manner, often with a small team and within budget. As the application grows, you start working with contextual data imported from enterprise systems like ERP or CRM. Perhaps existing analytical data should be utilized as well. At this point, a time series database does not fulfil the requirements, leading to the introduction of a relational database. This new technology needs to be integrated and operated properly. 

Applications now also require a backend to communicate with different data stores as they cannot speak five or six different languages to access each individual storage technology. This application backend could be code or a more complex solution like data virtualization or federated query layers. Alternatively, or in addition, you might start integrating a document database for easier and faster schema management for more diverse contextual data. This addition to the tech stack introduces a new language that developers need to learn. As the application gains more users, the demand for data search functionality grows, prompting the addition of a search engine to the tech stack and yet another language that developers need to learn. 

The number of data integration and synchronization processes also increases. Taking this scenario further, a geo-database and a new query language may be introduced for specialized tracking purposes. Finally, as users request a chatbot interface, a vector database might be introduced to store the necessary semantic representations of your data. These technologies also come with their own proprietary APIs. 
 
In the end, you have a very complex architecture with a lot of data replication, different technologies and multiple different languages in use for each of these technologies. Usually, this reality is hidden behind a well-drawn architecture picture, but it remains a spaghetti architecture. The effort in terms of people, time, and money, has grown significantly at this stage, demonstrating the complexity and challenges of managing all data that is necessary to build an end-to-end time series solution. 

In summary, as time series data management projects evolve, they often encounter a growing complexity and mounting technical debt. This debt, once established, becomes increasingly hard to eliminate. Managing diverse data types in separate data stores necessitates data integration and synchronization, which add to the project's complexity.  

Furthermore, scaling up and out multiple data stores can prove challenging, as traditional time series databases often falter under such demands. The result is an intricate application backend due to different languages and silos, which can be quite challenging to manage effectively. 
 
These issues have significant implications. On the people front, the complexity necessitates multiple skill sets and a larger team to maintain the system effectively. This situation is far from the ideal scenario of simple and fast training, high developer productivity, and efficient operations. In terms of time, these challenges slow down the delivery of value, prolong the time required for changes, and lead to a lot of overhead activities. This is a stark contrast to the desired state of fast development and rapid changes. Moreover, these issues result in high total cost of ownership and high cost of change, both of which businesses would prefer to minimize. 
 
The solution to these challenges lies in maintaining a single source of truth that is kept updated in near real-time. This involves native support for multiple data types in one technology, accessible via standard SQL. A dynamic schema supports rapid application development and evolution, while high performance, availability, and scalability, ensure that the system can handle the demands placed upon it. In this way, businesses can effectively manage time series data, reducing complexity, minimizing Total Cost of Ownership, and accelerating time to value. 

This is exactly how CrateDB can help. Beyond its capability of storing tabular and time series data, CrateDB provides several additional features that enhance its utility. These include support for JSON, full-text search, vector storage and similarity search, as well as geospatial data handling. It can also store binary data. All of these data types can be combined in a single database record if needed and be easily accessed via standard SQL, making CrateDB a robust and versatile solution for time series data management. 

The dynamic schema and user-defined functions add the needed flexibility for efficient application development and maintenance when it comes to building new features, changing data schemas of sources, and integrating new data sources. 

The versatility and flexibility are backed by a distributed query engine that allows for massive high-volume concurrent reads & writes. The columnar storage and advanced indexing techniques help to support complex queries and aggregations which eases development, as not all indexes need to be defined manually. The distributed nature allows for highly available and horizontal scalable architectures.  

Last but not least, CrateDB can be deployed on the edge, in your data centers, as well as in cloud environments. It is also available as a fully managed service. A synchronization mechanism, called Logical Replication, helps to keep the edge and other clusters in sync. For example, to store and process data on the edge, and synchronize the relevant information into the cloud, to perform holistic analyses.