# Building Retrieval Augmented Generation (RAG)

This guide demonstrates how to build a Retrieval Augmented Generation (RAG) application using Better Stack Warehouse, all **without needing to generate or store** your own [embeddings](https://betterstack.com/docs/warehouse/vector-embeddings/intro/).

## Overview

Imagine you need to **classify incoming emails** by comparing them against a large repository of existing emails. A great way to do this is **generate vector embeddings** for your existing emails, to allow you to perform semantic distance searches between those emails, and any new ones.

[note]
#### Not familiar with vector embeddings?

Feel free to read through an [introduction to embeddings](https://betterstack.com/docs/warehouse/vector-embeddings/intro/) first 🙌
[/note]

With Better Stack Warehouse:

* **No manual embedding needed.** Let Warehouse [automatically generate embeddings](https://betterstack.com/docs/warehouse/vector-embeddings/built-in-embeddings/) for your data at both insertion and query time.

* **Automatic public APIs with result caching.** Utilize [saved queries](https://betterstack.com/docs/warehouse/querying-data/queries/) to easily retrieve search results in your application without writing complex SQL, and benefiting from automatic caching.

This guide will take you through setting up your Warehouse config, importing your existing emails into Warehouse, and then using a saved query with similarity search to find the most relevant historical emails based on a new search term or email body.

## Create a source

The source will hold all your data and allow you to query them.

* Go to **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") -> [Connect source](https://warehouse.betterstack.com/team/t328468/sources/new ";_blank").
* Pick a name and your preferred data region.

![Create a source](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/1c113cb6-838b-400e-1d58-3ecf3050ab00/md2x =3250x1700)

## Create an embedding definition

We need to tell Warehouse where to find the text to generate embeddings for first:

* Go to **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") -> Your sources -> **Embeddings**.
* **Define an embedding** by specifying the `text` as JSON path to read text from and `text_embed` to write the generated embeddings to.

You can customize model options here. Currently, `embeddinggemma:300m` is offered, which is an excellent model.

For specific model requests, contact hello@betterstack.com.

![Create an embedding definition](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/1307d575-bb77-49eb-517b-d06a751d3000/md1x =3314x2030)

## Configure time series

Next, we'll optimize storage for fast querying of your embeddings.

A vector index will give you **extremely fast similarity comparison**, even with hundreds of millions of rows.

* Go to **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") → Your source → **Time series on NVMe SSD**.
* **Create two time series**, both using **No aggregations**:
  * `text` - a string column to hold the actual email content.
  * `text_embed` - an Array(Float32) column to store the generated embeddings.

**Important**: Ensure the dimensions in the automatic vector index for `text_embed` match the output dimensions of the embedding model you defined in the previous step.

![Configure time series](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/67fd588f-4d0f-4394-3429-cfca8b6c3700/public =2648x2174)

## Send your data

[Ingest your data](https://betterstack.com/docs/warehouse/ingesting-data/intro/) into the Warehouse. Each record should have a `text` field containing the email content.  
    
```bash
[label Send emails to Warehouse]
curl -X POST https://$INGESTING_HOST \
   -H "Authorization: Bearer $SOURCE_TOKEN" \
   -H "Content-Type: application/json" \
   -d '[
        { "text": "Hello, this is a test email about product features." },
        { "text": "We are experiencing issues with our new payment gateway." },
        { "text": "Regarding your recent inquiry about account settings and billing." }
      ]'
```  

Warehouse will **automatically generate embeddings for these emails** based on your embedding definition.

If you visit **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") -> Your source -> [Live tail](https://telemetry.betterstack.com/team/0/tail ";_blank"), you should see records containing both `text` and `text_embed` fields, with an array of numbers.

## Create a query for similarity search

[Saved queries](https://betterstack.com/docs/warehouse/querying-data/queries/) let you define a query and **get a publicly-accessible URL** for that query to fetch JSON, CSV or TSV from any app or web-page, without needing to distribute credentials. As an added bonus, queries are cached automatically, making them **ideal for invoking in a front-end app** that may be loaded many times.

* Go to **Warehouse** -> [Queries as APIs](https://warehouse.betterstack.com/team/t0/queries ";_blank") -> [Create query](https://warehouse.betterstack.com/team/t0/queries/new ";_blank").
* **Pick your source**, and make sure `time series` is selected.
* **Create a new query using the SQL below.** This query automatically generates an embedding for a provided search term, then uses the `cosineDistance` ClickHouse function to compute the distance (similarity) between two vectors.

```sql
[label Query for similar emails]
SELECT text,
  cosineDistance(text_embed, embedding({{search}})) AS distance
FROM {{source}}
ORDER BY distance ASC LIMIT 100
```        

* **Save the query** to get a unique URL, such as:  `https://<cluster>.betterstackdata.com/query/<your_query_token>.json`

You can now **query this URL directly from your application**, or even a front-end app, by appending a `search` query parameter:  
`https://<cluster>.betterstackdata.com/query/<your_query_token>.json?search=my%20%search%20text`

The **API will return the 100 most similar records** from your dataset, with results automatically cached for performance 🚀

## Wrap-up

By following these steps, you can **set up a powerful RAG application** without the overhead of managing embedding infrastructure or hardware.

Better Stack handles data ingestion, storage, embedding generation, and efficient querying, allowing you to **focus on building your core application logic**.


This guide demonstrates how to build a Retrieval Augmented Generation (RAG) application using Better Stack Warehouse, all **without needing to generate or store** your own [embeddings](https://betterstack.com/docs/warehouse/vector-embeddings/intro/).

## Overview

Imagine you need to **classify incoming emails** by comparing them against a large repository of existing emails. A great way to do this is **generate vector embeddings** for your existing emails, to allow you to perform semantic distance searches between those emails, and any new ones.

[note]
#### Not familiar with vector embeddings?

Feel free to read through an [introduction to embeddings](https://betterstack.com/docs/warehouse/vector-embeddings/intro/) first 🙌
[/note]

With Better Stack Warehouse:

* **No manual embedding needed.** Let Warehouse [automatically generate embeddings](https://betterstack.com/docs/warehouse/vector-embeddings/built-in-embeddings/) for your data at both insertion and query time.

* **Automatic public APIs with result caching.** Utilize [saved queries](https://betterstack.com/docs/warehouse/querying-data/queries/) to easily retrieve search results in your application without writing complex SQL, and benefiting from automatic caching.

This guide will take you through setting up your Warehouse config, importing your existing emails into Warehouse, and then using a saved query with similarity search to find the most relevant historical emails based on a new search term or email body.

## Create a source

The source will hold all your data and allow you to query them.

* Go to **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") -> [Connect source](https://warehouse.betterstack.com/team/t328468/sources/new ";_blank").
* Pick a name and your preferred data region.

![Create a source](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/1c113cb6-838b-400e-1d58-3ecf3050ab00/md2x =3250x1700)

## Create an embedding definition

We need to tell Warehouse where to find the text to generate embeddings for first:

* Go to **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") -> Your sources -> **Embeddings**.
* **Define an embedding** by specifying the `text` as JSON path to read text from and `text_embed` to write the generated embeddings to.

You can customize model options here. Currently, `embeddinggemma:300m` is offered, which is an excellent model.

For specific model requests, contact hello@betterstack.com.

![Create an embedding definition](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/1307d575-bb77-49eb-517b-d06a751d3000/md1x =3314x2030)

## Configure time series

Next, we'll optimize storage for fast querying of your embeddings.

A vector index will give you **extremely fast similarity comparison**, even with hundreds of millions of rows.

* Go to **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") → Your source → **Time series on NVMe SSD**.
* **Create two time series**, both using **No aggregations**:
  * `text` - a string column to hold the actual email content.
  * `text_embed` - an Array(Float32) column to store the generated embeddings.

**Important**: Ensure the dimensions in the automatic vector index for `text_embed` match the output dimensions of the embedding model you defined in the previous step.

![Configure time series](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/67fd588f-4d0f-4394-3429-cfca8b6c3700/public =2648x2174)

## Send your data

[Ingest your data](https://betterstack.com/docs/warehouse/ingesting-data/intro/) into the Warehouse. Each record should have a `text` field containing the email content.  
    
```bash
[label Send emails to Warehouse]
curl -X POST https://$INGESTING_HOST \
   -H "Authorization: Bearer $SOURCE_TOKEN" \
   -H "Content-Type: application/json" \
   -d '[
        { "text": "Hello, this is a test email about product features." },
        { "text": "We are experiencing issues with our new payment gateway." },
        { "text": "Regarding your recent inquiry about account settings and billing." }
      ]'
```  

Warehouse will **automatically generate embeddings for these emails** based on your embedding definition.

If you visit **Warehouse** -> [Sources](https://warehouse.betterstack.com/team/0/sources ";_blank") -> Your source -> [Live tail](https://telemetry.betterstack.com/team/0/tail ";_blank"), you should see records containing both `text` and `text_embed` fields, with an array of numbers.

## Create a query for similarity search

[Saved queries](https://betterstack.com/docs/warehouse/querying-data/queries/) let you define a query and **get a publicly-accessible URL** for that query to fetch JSON, CSV or TSV from any app or web-page, without needing to distribute credentials. As an added bonus, queries are cached automatically, making them **ideal for invoking in a front-end app** that may be loaded many times.

* Go to **Warehouse** -> [Queries as APIs](https://warehouse.betterstack.com/team/t0/queries ";_blank") -> [Create query](https://warehouse.betterstack.com/team/t0/queries/new ";_blank").
* **Pick your source**, and make sure `time series` is selected.
* **Create a new query using the SQL below.** This query automatically generates an embedding for a provided search term, then uses the `cosineDistance` ClickHouse function to compute the distance (similarity) between two vectors.

```sql
[label Query for similar emails]
SELECT text,
  cosineDistance(text_embed, embedding({{search}})) AS distance
FROM {{source}}
ORDER BY distance ASC LIMIT 100
```        

* **Save the query** to get a unique URL, such as:  `https://<cluster>.betterstackdata.com/query/<your_query_token>.json`

You can now **query this URL directly from your application**, or even a front-end app, by appending a `search` query parameter:  
`https://<cluster>.betterstackdata.com/query/<your_query_token>.json?search=my%20%search%20text`

The **API will return the 100 most similar records** from your dataset, with results automatically cached for performance 🚀

## Wrap-up

By following these steps, you can **set up a powerful RAG application** without the overhead of managing embedding infrastructure or hardware.

Better Stack handles data ingestion, storage, embedding generation, and efficient querying, allowing you to **focus on building your core application logic**.

### Want to learn more?

* [Ingesting data](https://betterstack.com/docs/warehouse/ingesting-data/)
* [Built-in embeddings](https://betterstack.com/docs/warehouse/vector-embeddings/built-in-embeddings/)
* [Data structures and indices in embeddings](https://betterstack.com/docs/warehouse/vector-embeddings/data-structures-and-indices/)
* [Querying data](https://betterstack.com/docs/warehouse/querying-data/)
* [Saved queries as APIs](https://betterstack.com/docs/warehouse/querying-data/queries/)
* [Controlling costs](https://betterstack.com/docs/warehouse/management/controlling-costs/)
