CrateDB Fundamentals

Menu

Document Data

CrateDB Academy: CrateDB Fundamentals

 

In this video, you'll see how CrateDB's container data types, OBJECT, and ARRAY make CrateDB a powerful database for storing and retrieving document data. We'll look at the different ways that an object schema can be specified in CrateDB, touch on how CrateDB indexes objects, and try out some example queries.

JSON documents are everywhere. It's a common way of representing data, with many programming languages supporting the format. Sometimes these documents have a fixed schema, but it's not unusual for documents describing the same sort of data to differ in their representation of it. Here we have two JSON documents describing community areas in Chicago. The document on the right has additional notableresident and zipcodes components. notableresident is an OBJECT, zipcodes is an ARRAY.

Sometimes we want to create and enforce a strict schema for these documents, rejecting those that don't conform to it. For example, here we might want to reject the document for Norwood Park as it contains extra data we haven't planned for. In other cases, we want the flexibility to store and query such semi or unstructured documents in a single collection. How do we store and query these semi and unstructured data types in a relational type table? CrateDB has two container data types that enable us to model this sort of data in a flexible way. We'll begin by exploring the OBJECT type, then briefly touch on the second container type, the ARRAY.

Let's explore the different ways in which we can model objects with CrateDB using our community_areas data set. Here I'm creating a table for community areas using the areanumber as an INTEGER PRIMARY KEY and a TEXT field to store the area name. For everything else, I'm going to use an OBJECT named details. Inside that object, I've described a schema containing a TEXT field for description and a BIGINT field for population. Note that I've used the STRICT object policy, which means that CrateDB will enforce this schema and reject records that don't match it.

We'll go ahead and create the table, and now I'll paste in an INSERT statement. What we see here is that we're inserting a new record into the table, for area 35. It's called Douglas. It has a description and a population inside its details object. And you can see that that object is expressed as a JSON literal here in the insert statement. When I execute this statement, you'll see that again we get result OK, this object and this record were inserted into the table.

If we try again with a slightly different version of this where here I've added a notableresident sub-object that's not in our table schema. What we should see when I click execute is that we get a SQL error because the column notableresident isn't allowed inside the object details. This is where CrateDB has enforced that schema and has rejected this insert.

Let's create an alternative version of our community_areas table. This time I'm using the DYNAMIC object policy for the details object. This is the default in CrateDB. What this means is that when CrateDB encounters fields that are in the details object that are not in the schema, it will add them anyway and infer the appropriate data type. Let's create the table and now let's paste in a community area. In this version of the Douglas community area, I have that extra notableresident object that was not mentioned in the schema. When we run this query, it will run and execute without error. So let's see what the state of the table looks like now.

If we run the SHOW CREATE TABLE community_areas command, you'll see that there is a schema down here and this schema now contains notableresident object with field name and description that have been inferred as TEXT. In this way, CrateDB has accepted a dynamic object, modified the schema accordingly and will index all of these values for us. The final policy to look at is ignored.

Here I'm creating a third version of the community_areas table, and my details object is specified as OBJECT IGNORED. This means that CrateDB will accept fields that are not specified in the schema, but it won't index these. This means that we can store objects that have different shapes inside this object, and we'll see what that means shortly.

Let's first create the table and paste in an example record here for the community_areas, Douglas, we have an object containing notableresident, which wasn't part of the schema. And note that that itself has some structure inside there. There's an address sub-object, so let's create that and that worked.

Now what we're going to do is add a second record that has a slightly different structure from the 1st. I'm going to add a record for a neighbourhood called Norwood Park, and here it is. In this case we also have this notableresident object that wasn't part of our object schema, but now inside that the address is a string. It's no longer another object like it was in the previous record. We execute this, CrateDB stores that, and what's happened here is that the notableresident sub-object and anything inside it has been stored and can be queried, but hasn't been indexed. This means that when we run a query such as this one, you can see that we get back different versions of the address because each record that we inserted had a different data type there. In one case it was a string, in the other case it's an object.

The other container data type in CrateDB is an ARRAY. This is a container for other data types, including objects. Here I'm creating another version of the community_areas table, and inside my details object I've chosen to model an array of zip codes as TEXT. Let's create the table and let's look at an example record.

Here I'm inserting a record for Norwood Park which now has an array of zip codes containing 3 text strings. Let's create that. Now how do we query these? If we take a look at this SELECT query here, what we see is that there is a Python like syntax for slicing arrays. So here I'm asking CrateDB to select the first two entries in the zipcodes array from the details object in the field zipcodes where the area number is 10, which is Norwood Park. Let's go ahead and run that and you'll see that we get an array response containing those first 2 zip codes. Now you've seen how CrateDB can store and query JSON data using the OBJECT and ARRAY data types. As the next step, I'd encourage you to experiment with the different object policies and try queries against the sample data set in your CrateDB cluster.