Using built-in embeddings

Better Stack Warehouse allows you to generate text embeddings without calling any third-party API.

Go to Warehouse -> Sources -> Your sources -> Embeddings and choose what text field to generate embeddings from and what JSON attribute we should store it in.

Generating embeddings in Better Stack Warehouse

After setting this up, we'll calculate the embedding for all new events you send into the source 🚀

Can't see your events immediately?

Embeddings can take some time to process, especially for large sets of data.

For example, if you send us 10 GB worth of events with each one being 1000 tokens, that could take us quite a long time to fully process the embeddings. Your data will appear as soon as the embeddings are calculated.

Storing embeddings as a time series

Go to Warehouse -> Sources -> Your source -> Time series on NVMe SSDs and click + Time series.

Use JSON dot notation to write the name of the target column from the previous step and choose BFloat16.

You'll need to create a time series with the data you'd want to find based on the embedding. Easiest way would be to also create a non-aggregate time series for the text field. To minimize the data volume stored in NVMe, create a time series containing an ID, and use it to retrieve the full JSON event afterwards.

Creating a time series for your embeddings

Indexing vector columns

If you're working with hundreds of millions of events, your queries might benefit from creating vector indices for your embeddings. Make sure to select the same Index dimension as in your embedding.

Creating a vector index for embeddings

Querying embeddings in Warehouse

Better Stack Warehouse stores embeddings as vector columns that can be indexed and queried efficiently using ClickHouse’s vector type and similarity functions such as distance() or cosineSimilarity().

Querying in time series Querying in events
SELECT
  text_id, -- fetch raw texts from JSON events to minimize data on NVM
  cosineDistance(
    meta_embeddings_text,
    embedding({{description}})
  ) AS distance
FROM {{source}}
ORDER BY distance ASC
LIMIT 5
SELECT
  JSONExtractString(raw, 'text'),
  cosineDistance(
    JSONExtract(raw, 'meta', 'embeddings', 'text', 'Array(BFloat16)'),
    embedding({{description}})
  ) AS distance
FROM {{source}}
-- to improve performance, use WHERE filtering
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
ORDER BY distance ASC
LIMIT 5

This finds the events most semantically similar to a given description.

Curious about the embedding() function?

This is not a real function in the underlying ClickHouse database, and can only be used for string values such as hardcoded string or a Query variable.

If you try to use it dynamically on your data, e.g. via SELECT embedding(text), you would get an error about a missing function:

Code: 46. DB::Exception: Unknown function embedding: While...