CrateDB Fundamentals

Menu

Streaming & Foreign Data Wrappers

CrateDB Academy: CrateDB Fundamentals

 

It's time to look at how CrateDB can be used to provide insights into fast moving continuous streams of data. I'll then also show you how to use data stored in other databases in your CrateDB queries with foreign data wrappers. Let's begin with streaming.

A data stream is a continuous flow. There's no defined beginning or end to it. Data arrives constantly, usually as the result of events happening in a system. Some examples being a user clicking on a website, a new entry being created in an application log, or perhaps a sense of providing an updated reading. Streaming data is often time sensitive in nature because data is continuously arriving. Insights from the stream are best obtained using aggregations or queries over a rolling time window.  CrateDB's ability to ingest and index data at high speed, along with its flexible schemas make it ideal as a database for working with streaming data. Specialised software is often used to manage data streams.

Let's take a look at an example with one of the most popular open source streaming packages. Apache Kafka is an extremely popular open source event streaming platform. Its Connect interface integrates with hundreds of event sources which create events and syncs where event data is ultimately stored for long term analysis. Events are received by Kafka and published onto different message topics. Consumers subscribe to these topics, receive message events and perform processing on them. This processing may be performed by other specialised frameworks. For example, here Apache Flink.

Event data can be inserted into CrateDB using the Kafka Connect JDBC driver via Flink or other logic in your consumer code. CrateDB is excellent in certain indexing performance means that this time critical data is available to query in near real time. This is critical to gaining insights and making timely decisions when working with this sort of system. From here you can use the power of CrateDB's scalable SQL engine to aggregate and analyse the data, combine it with metadata or data from other sources. Dashboards and alerting mechanisms can be built using standard tools and these can provide a continuous near real time feedback loop enabling adjustments to be made that affect or influence the incoming event streams.

Foreign Data Wrappers are a way of handling access to remote data from SQL databases. They are part of the SQL Management of External Data standard. They allow you to write queries that combine data stored in CrateDB with data in external databases without the need for ETL processors. CrateDB's foreign data wrapper implementation uses JDBC and allows you to connect to remote Postgres databases. When querying data in a foreign database, clauses such as GROUP BY and LIMIT are executed in CrateDB. Let's see how this works in practice.

Here I've set up a Postgres instance on Neon's platform online. I've created a new table called taxi_drivers and populated it with the sample data you see on the screen. I've also created a user named crate to access this data. Let's connect to this database from CrateDB and query it using the foreign data wrapper.

Switching over to CrateDB. The first step in configuring a foreign data wrapper is to create a server definition. Here I'm providing the connection URL for the foreign Postgres instance and naming it pg_server using the CREATE SERVER command. The next step is to map a user in CrateDB to a user in the foreign database. CREATE USER MAPPING tells CrateDB that for the server pg_server the user crate has the given password.

Now we have to create a foreign table definition in CrateDB. The schema of this table must match the schema of the table we want to query in the remote instance. And we also tell CrateDB which server to find this table on and the name of the schema containing it there. We're ready to start querying data in the remote Postgres instance using the foreign data wrapper. This straightforward SELECT query does just that, and here are the results of executing it. Remember, the data here resides in the remote Postgres instance. It hasn't been duplicated into a table in CrateDB.

The foreign data wrapper is great for aggregating data from other databases into CrateDB for analytics and reporting. You can of course join data in foreign tables with data in tables that reside in CrateDB performing cross database joints, and you can access foreign data without the time and storage requirements associated with ETL processes. Finally, you could use the foreign data wrapper as part of a migration from another database into CrateDB.

You should now have a high level understanding of how streaming data can be brought into CrateDB and how data residing in other databases can be used in queries without duplication. To learn more about these topics, consult the resources associated with this video and the online CrateDB documentation.