CrateDB Fundamentals

Menu

Vector Similarity Search

CrateDB Academy: CrateDB Fundamentals

 

CrateDB can be used as a vector database to store, retrieve and search data represented as vector embeddings. As CrateDB is a multi-model database, these embeddings can be seamlessly integrated into your existing data sets. This approach makes CrateDB a powerful foundation for AI and machine learning systems. Recommendation engines, intelligent chat bots, and anomaly detection through machine learning are just some of the possible applications here.

In this video, I'll show you how CrateDB stores vector data and how to perform vector searches with approximate nearest neighbour algorithms using SQL. We'll start with a look at how vector data is represented in CrateDB.

So what are vectors and how do we store them in CrateDB? Vectors are numerical representations of data used to quantify their features and characteristics. Vectors transform complex data such as text, images, and audio into a format that allows them to be compared with each other mathematically. They capture the semantic meaning of the original data.

CrateDB offers native support for vectors with the FLOAT_VECTOR data type. Float vector is a one-dimensional array like structure. Each element is a floating point number representing one or more attributes or characteristics of the original data. The array has a fixed length specified in your table schema. CrateDB supports a maximum length of 2048, allowing the output of major embedding algorithms such as those provided by Hugging Face and OpenAI to be stored.

Here's a version of our Chicago Community Areas table that adds a new column for storing vectors. The description_vec column is where we'll store a vector representation of the text contained in the description field of the details object column. Note that with CrateDB the, vector data can live right inside the table along with the rest of the data about each community area We could also have put it inside the details object if we'd chosen to in. In this single table we now have numerical data, geospatial, text with a full text search index defined, and vector data.

Here's an example of what the description text for a community area looks like. So how do we convert this text into the format required for the FLOAT_VECTOR column? For that we need to generate embeddings using a model. Using a small Python program and Open AI's text embedding 3 large model, I generated vectors that each have a dimension of 2048. Of course, you can use other models to generate the embeddings. Each community area's description was read from CrateDB, then sent to OpenAI's API, which uses a trained model to generate the vector representation. The result of this process was stored in the description_vec column using a standard SQL update.

Here we see the original text and the vector embeddings for the community area O'Hare. Note that we retrieved these values using a standard SQL SELECT query.

Now we've got vector data in our table. How do we query it? Suppose you're on the lookout for somewhere to live in Chicago, but you don't know the city all that well. You visited a couple of neighbourhoods and enjoyed your time in Hyde Park. You're wondering which neighbourhoods are like Hyde Park, so you can check them out too. This query helps to answer that question using the KNN match function. KNN match implements the nearest neighbour search, finding data points most similar to a given one.

The first parameter is the name of a column containing vectors to search. Here we're searching the description_vec column. The 2nd parameter is the query vector. We provide a vector representation of the data that we want to find nearest neighbours for. Here I'm using a sub-select to retrieve the vector representation of Hyde Park's description. And the final parameter is the number of nearest neighbours to search in the index. As Hyde Park will be a strong match for its own description, and we already know we like it there, I've excluded it from results like this. Finally, this clause ensures that matches are returned with the most relevant first, and that we get up to five results. The _score column gives us a relative indication of how strong a match we have. Let's go ahead and run the query.

It looks like we should go and look at some of these other community areas that have similar descriptions to Hyde Park. Don't forget that CrateDB is a multi-model database where different types of data can all be accessed using SQL. This means we can perform multi-model queries with ease. What if we wanted a more nuanced search using more than just the vector data?

This query finds community areas like Hyde Park, but which also have land in the area shown on the map. The INTERSECTS clause here uses geo data in the boundaries column of the community_areas table to filter the results there. Let's run the query and this time we only get 4 results. Rogers Park no longer meets our criteria because it's outside of our geo search area as it's located much further north than where we want to find somewhere to live.

In this short video, I was only able to scratch the surface of what's possible with vector data. With CrateDB's multi-model approach, you're not limited to just vector searches. You can combine these with full-text search, time series and relational data for true hybrid searching. For more information, please check out the resources associated with this video and the CrateDB online documentation.