CrateDB Fundamentals

Menu

Aggregating and Grouping Data

CrateDB Academy: CrateDB Fundamentals

 

When working with large amounts of data, we often need to simplify or summarise the data set in order to gain insights from it. In SQL, this can be achieved by using a combination of aggregate functions and grouping. CrateDB supports both of these concepts, and we'll explore them in this video using our Chicago taxi rides table. Let's dive in!

Here I've connected to my CrateDB cluster with the Crash CLI. The first thing to do is find out how many records are in our taxi_rides table. This query tells us that there are 245,000 records. These are all of the rides for the month of April 2024. We've actually already started using aggregations here. The COUNT function is a good example of a simple aggregation. It returns a single summary value telling us how many rows are in the table. With this much information available, the value of data contained in a single row is often less important than the insights gained from looking at the summary of some metric applied to all or some of the rows.

Let's dig in further and see how CrateDB can help us understand the data in this table. Here is the schema for our taxi_rides table. The totalcost column contains data about the total dollar cost of each taxi ride made up from the base fare, plus any tips, tolls, and extras. Using aggregate functions we can work out what the average total cost of a taxi ride was, and we can see that the average amount paid was around $25.73. The average aggregation function only considers values that are not null, so we don't need to add an extra clause to our SQL statement to exclude these from the calculation.

Using the MAX aggregation function, we can find the highest value for a given column. This query finds the ride that went the furthest, and that's a pretty long way.

What if we wanted to know the trip ID of the furthest ride so that we could dig into the details some more? Let's find out how. The max_by aggregation function returns the value of 1 column from the row that contains the maximum value of another. This query retrieves the trip ID for that long ride. Using this ID we can ask questions such as which vehicle drove that trip and what was the total cost. Here we see it cost $1579.50 and that the trip was undertaken by vehicle 5413.

As you might expect, we can also use aggregations against a subset of the records in a table. Here we're asking for the average total cost of trips between 10 and 20 miles long. These trips cost, on average $47.40. Aggregate functions produce a single result. We often need to produce richer summaries of the data. For example, what if we wanted to know the average total cost per ride originating in each of the city's community areas? For this, we need to introduce grouping using a GROUP BY clause.

GROUP BY groups rows that have the same values. Aggregation functions can then be applied to these groups to produce a single resulting value for each one.

Here is a query that tells us the average cost per ride, split out by community area. I've chosen to return these in descending order of average cost and to limit my result set to five entries. Community area 76 seems to be the best place for taxi drivers to pick up passengers whose trips have the highest average total fare.

We could also return the community name directly using a single query like this, and here's our combined result. It's not surprising to find out that these trips originate from the O'Hare community area, which is home to Chicago's largest International Airport.

In this video, you learnt how CrateDB helps you make sense of large amounts of data using aggregate functions and grouping. You'll have noticed the impressive performance that CrateDB offers when performing these operations. Each query in this video taking mere milliseconds to complete. This is in part due to the use of columnar storage by default rather than using a row based format. Adopting this approach, in which relevant values are stored together makes aggregations more efficient.

I encourage you to experiment with these concepts using your own CrateDB cluster, the taxi_rides table and by referring to the relevant sections of the CrateDB documentation online.