Building Your First RAG Application

Introduction

Retrieval-Augmented Generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information from outside their training data. By pulling in relevant external documents or data when answering queries, RAG helps LLMs provide up-to-date, factual answers using domain-specific sources rather than relying on stale training data alone. This approach not only reduces AI hallucinations (made-up facts) but also means we don’t have to constantly retrain the model on new information.

In this tutorial, we’ll build a RAG application from scratch using Python. Our app will integrate a public status page API as a live data source and OpenAI’s GPT model for generation. Imagine a chatbot that can answer questions like “What’s the latest system incident?” or “How is Service X doing today?” by retrieving real-time status updates and feeding them into an LLM. We will walk through every step:

Fetching data from the status page API.
Creating a vector store of embeddings for efficient similarity search.
Prompting GPT with the retrieved context.
Building an interactive interface to ask questions.

By the end, you will have a working Python RAG application that augments GPT with live status data. This guide is written in a clear, step-by-step format, with code snippets and explanations for each part of the process.

Let’s dive in!

Prerequisites

Before we start coding, make sure you have the following:

Python 3.12+ – Installed and up to date - RAG benefits from the latest Python features.
OpenAI API Key – with access to GPT-4 or similar models. Sign up at OpenAI and create an API key, then set it as an environment variable OPENAI_API_KEY.
A public Better Stack status page – We will use https://status.betterstack.com in this article, but feel free to use your own!
Basic familiarity with Python is helpful. No prior experience with RAG is required – we’ll explain each concept as we implement it.
Libraries/Dependencies – We’ll use the openai Python client, requests for HTTP calls, and streamlit for the UI. We’ll also use faiss-cpu (Facebook AI Similarity Search) for vector indexing, plus numpy for array handling.

Project Setup

First, let's create a proper project structure and set up our environment:

Set up your project

Copied!

# Create project directory
mkdir status-page-rag-app
cd status-page-rag-app

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies using pip
pip install openai requests streamlit faiss-cpu numpy python-dotenv
pip freeze > requirements.txt

# Create empty app.py and .env files
touch app.py
touch .env

Add your OpenAI API key to the .env file:

.env

Copied!

OPENAI_API_KEY=your_api_key_here

Now we're all set and can start building our application! 🚀

Getting Started

Let’s set up our project and verify access to the APIs. Start by importing the necessary modules and initializing the OpenAI client. We will use the OpenAI v1 client interface, which involves creating a client object rather than using module-level functions.

To keep things simple, we will be building the whole application in a single app.py file. Everything should work nicely by just appending the code block from each chapter.

app.py - OpenAI Client Initialization

Copied!

import os
from dotenv import load_dotenv
from openai import OpenAI, AuthenticationError

# Load required credentials from environment variables
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise Exception("Please set the OPENAI_API_KEY environment variable.")

# Initialize OpenAI API client
try:
    client = OpenAI(api_key=OPENAI_API_KEY)
except AuthenticationError:
    raise Exception("Failed to authenticate with OpenAI API. Check your API key in the OPENAI_API_KEY environment variable.")

In the code above, we retrieve the OpenAI API key from an environment variable and ensure it’s present. We then instantiate an OpenAI client with that API key. If the key is missing or invalid, an exception is raised so that you know to set or correct the API key before proceeding. By using OpenAI(api_key=...) to create a client, we can later call methods like client.embeddings.create() and client.chat.completions.create() to interact with the API.

It's best to test each part of the application to make sure everything is working as expected. Add the following code temporarily:

app.py - Temporarily add a quick test

Copied!

# ...

print("Authenticated with OpenAI API. Available models:")
for model in client.models.list().data:
    print(f"- {model.id}")

Then, run the script:

Run your application

Copied!

python app.py

When successful, you will see output that looks like:

Output

Authenticated with OpenAI API. Available models:
- gpt-4
- gpt-4-turbo
- gpt-3.5-turbo
- text-embedding-3-small
- text-embedding-3-large
- dall-e-3
- whisper-1

Now, with our environment set up and the OpenAI client ready, we can move on to understanding the RAG architecture we’ll implement.

RAG Concepts

A RAG application consists of two main parts: retrieval of relevant context and generation of answers using that context. In practice, building a RAG pipeline involves a few key steps:

Indexing (Embedding the Data): First, gather the external data that the LLM should know about – in our case, status page updates. We then convert this text data into vector embeddings using an embedding model. These embedding vectors capture the semantic meaning of the text in a high-dimensional space.
Retrieval (Similarity Search): When a user asks a question, we embed the query in the same vector space and search for the most similar embeddings among our data. This lets us retrieve the most relevant pieces of text (e.g. recent incident reports or status messages) that might contain the answer. We use a vector index for efficient similarity search, so this step is fast even if we have many data points.
Generation (Augmented Answering): Finally, we feed the retrieved context along with the user’s question into the LLM (GPT) and ask it to formulate an answer. By providing the LLM with relevant up-to-date information from the status API, we ground its response in real data. The GPT model will incorporate the provided status details when generating the answer, resulting in a more accurate and context-aware response.

In simpler terms, our app will fetch the latest status page data (such as incident titles or updates), store it in an embedding-based knowledge base, and for each query, find the best matches from that knowledge base to help GPT answer correctly. This means if the status page says “Service A is currently down due to network issues”, and a user asks “Is Service A operational?”, the app can retrieve that update and GPT will respond with something like “Service A is experiencing downtime due to network issues as per the latest status update.” The power of RAG is that the LLM’s answer is augmented with real-time data rather than just its training memory.

With the concept clear, let’s start coding the retrieval pipeline by fetching data from the status API.

Fetching Data from the Status API

Our first task is to retrieve live data from the status page. The following code defines a function to fetch the latest status reports via the API. We use Python’s requests library to make an HTTP GET request. In our case, we’ll call the Better Stack public status page endpoint /index.json, which does not require any authentication. We include basic error handling to catch network issues or non-200 HTTP responses, and then parse the JSON data to extract the relevant pieces of information (such as incident titles and updates).

app.py - Fetching Status Data

Copied!

import requests

def fetch_status_data():
    # JSON endpoint of Better Stack status page (no auth required)
    # Feel free to replace it by your own status page
    url = "https://status.betterstack.com/index.json"
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        data = response.json()
    except Exception as e:
        print(f"Error fetching data: {e}")
        return []

    # Extract all included reports, updates, and resource names
    included = data.get("included", [])
    reports = [i for i in included if i["type"] == "status_report"]
    updates = {i["id"]: i for i in included if i["type"] == "status_update"}
    resources = {
        r["id"]: r["attributes"]["public_name"]
        for r in included if r["type"] == "status_page_resource"
    }

    # Build status texts for each report
    status_texts = []
    for report in reports:
        a = report["attributes"]
        title = a.get("title", "Untitled")
        state = a.get("aggregate_state", "unknown")
        rtype = a.get("report_type", "manual")
        affected = [r["status_page_resource_id"] for r in a.get("affected_resources", [])]
        affected_names = [resources.get(rid, f"resource {rid}") for rid in affected]

        # Format all related updates into messages
        updates_list = report.get("relationships", {}).get("status_updates", {}).get("data", [])
        messages = []
        for ref in updates_list:
            u = updates.get(ref["id"])
            if u:
                ts = u["attributes"].get("published_at", "")
                msg = u["attributes"].get("message", "").strip()
                messages.append(f"Update at {ts}:\n{msg}")

        # Format all relevant information into a single text
        text = f"Incident: {title}\nState: {state}\n"
        if rtype == "maintenance":
            start = a.get("starts_at", "unknown")
            end = a.get("ends_at", "unknown")
            text += f"Time: {start} - {end}\n"
        text += f"Affected: {', '.join(affected_names) or 'unknown'}\n\n"
        text += "\n\n".join(messages)

        status_texts.append(text)

    return status_texts

A few things to note in this snippet:

We fetch JSON from the Better Stack public status page.
We combine the incident title, state, affected services, and all updates into one readable text block. This format will ensure no context is lost, and can be easily understood.
Only maintenance reports show their start/end times. Regular incidents will contain full timeline of status updates.
These complete report summaries are collected into status_texts, ready for indexing.

app.py - Temporarily add a quick test

Copied!

# ...

print("Fetching status data...")
texts = fetch_status_data()
print(f"Fetched {len(texts)} status reports.")
if texts:
    print(f"Latest status report:\n{texts[-1]}")

Run the script:

Run your application

Copied!

python app.py

You should see output that looks like:

Output

Fetching status data...
Fetched 3 status reports.
Latest status report:
Incident: Database Performance Issues
State: resolved
Affected: API, Dashboard

Update at 2024-01-15T10:30:00Z:
We are investigating reports of slow database queries affecting our API response times.

Now that we can fetch the status data, let’s create the vector store (index) for our retrieved texts.

Embedding and Indexing

With status_texts (the list of incident titles or messages) in hand, the next step is to embed these texts into vectors and build a vector index. We will use OpenAI’s text embedding model (text-embedding-3-small) to convert each piece of text into a 1536-dimensional embedding vector. These embeddings numerically represent the semantic content of the text, enabling similarity comparisons.

If you just started with AI, this step may sound overly technical and academic. Imagine the embedding vector as converting text into a special kind of number that tells a computer what the text means. It's like turning a sentence into a dot in space — and dots with similar meanings are close together. So later, when someone asks a question, the computer can look for the closest matching dot (text) — even if the words are different — because the meanings are nearby.

We’ll then use FAISS (Facebook AI Similarity Search) to create an index for fast nearest-neighbor search among those vectors. FAISS allows us to efficiently find which stored vectors are closest to a given query vector – exactly what we need for retrieval. If you installed faiss-cpu as suggested, we can proceed to use it.

Let’s write a function to build the index from a list of status texts. This function will call the OpenAI embeddings API and handle edge cases like empty inputs or API errors. We’ll also implement some best practices: only (re)build the index when needed (to avoid redundant computation) and handle empty embedding results gracefully.

app.py - Building Vector Store

Copied!

import numpy as np
import faiss

EMBEDDING_MODEL = "text-embedding-3-small"

def build_index_from_texts(texts):
    """Generate embeddings for the given texts and build a FAISS index. Returns the index or None."""
    if not texts:
        # No texts to index; return None (clear any existing index)
        return None

    # Request embeddings from OpenAI
    try:
        embed_response = client.embeddings.create(input=texts, model=EMBEDDING_MODEL)
    except Exception as e:
        print(f"OpenAI Embedding API error: {e}")
        return None

    # Safety check: ensure we got embeddings for the texts
    if not embed_response.data:
        print("Warning: No embeddings returned for the input texts.")
        return None

    # Extract embedding vectors from the response
    embeddings = [record.embedding for record in embed_response.data]

    # Build a FAISS index (L2 distance for similarity search)
    dimension = len(embeddings[0])
    index = faiss.IndexFlatL2(dimension)
    index.add(np.array(embeddings, dtype='float32'))
    return index

In this code:

We define a constant EMBEDDING_MODEL for clarity. We’re using the Ada v2 embedding model (text-embedding-3-small), which is well-suited for semantic search.
The function build_index_from_texts first checks if texts is empty. If so, it immediately returns None. This way, we avoid building an index when there’s no data (and we can use None to indicate “no index” when data is absent).
We call the OpenAI embeddings API (client.embeddings.create) with the list of texts. Using the OpenAI v1 client, this returns a response object containing a .data list of embedding results. We wrap the call in a try/except to catch any API errors (like rate limits or authentication issues) and print an error message instead of crashing.
After the API call, we check if not embed_response.data:. This is the graceful handling of empty embeddings — in the unlikely event the API returns an empty result (which could happen if the input list was empty or some other issue), we log a warning and return None. This ensures our code doesn’t proceed with no data.
If we have embeddings, we extract them into a Python list. Each record in embed_response.data has an .embedding attribute that is a list of floats (the vector). We use a list comprehension to collect all embedding vectors.
We then initialize a FAISS IndexFlatL2 with the embedding dimension. IndexFlatL2 is a simple index that computes L2 (Euclidean) distances — fine for moderate-sized data. We convert our list of embeddings to a NumPy array of type float32 (required by FAISS) and add it to the index.
Finally, we return the built index object. If at some point we want to update the data, we can rebuild or update this index with new embeddings.

At this point, we have the capability to build (or rebuild) our vector index from the latest status texts. We will make sure to only build the index when necessary. In practice, this means we’ll call build_index_from_texts when we first fetch data or if our index is currently None. If we already have an index and the data hasn’t changed, we can reuse it without rebuilding — saving time and API calls. We’ll see this in action in the Streamlit integration.

GPT Integration

Now comes the generation part of our RAG app. We need to accept a user’s question, retrieve relevant context from our indexed data, and then query the GPT model to generate an answer using that context.

The plan is as follows:

Embed the user’s query into the same vector space as our data.
Search the FAISS index to find the most similar texts from status_texts to the query.
Construct a prompt for GPT that includes these retrieved texts as context.
Call the OpenAI Chat Completions API to get an answer.

We’ll write a function answer_query that performs these steps. This function will need access to our index and the original texts list (to map indices back to text). We’ll use a system message to instruct the assistant to use the provided context, and a user message that includes the actual question along with the retrieved context.

app.py - Integrating GPT

Copied!

def answer_query(query, index, texts, model="gpt-4", k = 3):
    """Given a user query and our knowledge index, retrieve relevant data and get an answer from GPT."""
    if index is None or not texts:
        return "I don't have any status data to answer that question."

    # 1. Embed the user query
    try:
        query_response = client.embeddings.create(input=[query], model=EMBEDDING_MODEL)
    except Exception as e:
        print(f"Error embedding query: {e}")
        return "Sorry, I couldn't process your question at this time."
    query_vector = np.array(query_response.data[0].embedding, dtype='float32')

    # 2. Retrieve top-k similar texts from the index
    distances, indices = index.search(query_vector.reshape(1, -1), k)
    retrieved_texts = [texts[i] for i in indices[0] if i < len(texts)]

    # 3. Construct the prompt with context
    context = "\n".join(retrieved_texts)
    system_msg = {
        "role": "system",
        "content": "You are a helpful assistant answering questions about a system's status. Use the provided status updates to give an accurate answer."
    }
    user_msg = {
        "role": "user",
        "content": f"Here are some relevant status updates:\n{context}\n\nQuestion: {query}"
    }

    # 4. Generate answer using OpenAI ChatCompletion
    try:
        chat_response = client.chat.completions.create(
            model=model,
            messages=[system_msg, user_msg],
            temperature=0
        )
    except Exception as e:
        print(f"OpenAI ChatCompletion API error: {e}")
        return "Error: Unable to get a response from the language model."
    answer = chat_response.choices[0].message.content.strip()
    return answer

Let’s break down how answer_query works:

Check for data availability: If the index is None or we have no texts, the function immediately returns a message indicating that there is no status data to answer the question. This prevents attempting a search or GPT query when we have no context.
Embed the query: We use client.embeddings.create again, this time with the user’s query as input. The result query_response should contain one embedding accessible as query_response.data[0]. We convert that embedding into a NumPy vector query_vector. This step is also wrapped in try/except to handle any errors in embedding the query.
Retrieve relevant texts: We use the FAISS index’s search method to find the nearest neighbors to the query vector. We ask for the top 3 matches (k = 3). The index.search returns a tuple of (distances, indices). We take the indices of the results and map them back to the original texts list to get the actual status update texts that are most relevant to the query.
Construct the prompt: We then prepare the messages for GPT. We join the retrieved texts with newline characters to form a context string. We create a system message to prime the assistant with instructions (telling it to use the status updates to answer accurately), and a user message that contains the actual question prefixed by the retrieved status updates. By structuring the prompt this way, we give GPT the relevant information to use when formulating its answer.
Generate the answer: Finally, we call the OpenAI Chat Completion API (client.chat.completions.create) with our messages. We set temperature=0 for a deterministic answer. If the API call succeeds, we extract the assistant’s reply (chat_response.choices[0].message.content) and return it. If there’s an exception (e.g., API error or timeout), we catch it and return an error message string.

app.py - Temporarily add a quick test

Copied!

# ...

question = "Is there any maintenance planned for next month?"
print(f"Q: {question}")
texts = fetch_status_data()
answer = answer_query(question, build_index_from_texts(texts), texts)
print(f"A: {answer}")

Run the script:

Run your application

Copied!

python app.py

You should see output that looks like:

Output

Q: Is there any maintenance planned for next month?
A: Based on the provided status updates, there is no information about any maintenance planned for the next month. The most recent incident updates do not mention any upcoming maintenance activities. If you have any specific concerns or questions, you can reach out to the support team at hello@betterstack.com for further assistance.

With this function, our backend logic for Q&A is complete: we can take a question, find context, and get an answer from GPT. The next step is to integrate everything into an interactive interface.

Streamlit Interface

At this point, we have all the core pieces of our application: the ability to fetch data, build and search the vector index, and get GPT to answer using retrieved context. Now, let's tie it all together in a Streamlit interface for interactivity.

Building the Streamlit Interface: Streamlit makes it straightforward to create a simple web interface for our app. We will use Streamlit to allow the user to input questions and to display the answers. We will also utilize st.session_state to store our index and data between interactions, so we don’t rebuild everything on every question. Additionally, we’ll implement a refresh interval for the data (every 5 minutes) to keep our context updated.

Here’s the code for the Streamlit app. It initializes the state, fetches data when needed, and defines the UI components:

app.py - Streamlit Interface Code

Copied!

import streamlit as st
from datetime import datetime, timezone

# Initialize session state for persistence
if "index" not in st.session_state:
    st.session_state.index = None
    st.session_state.status_texts = []
    st.session_state.last_fetch = None

# Refresh data every 5 minutes
NOW = datetime.now(timezone.utc)
if st.session_state.last_fetch is None or (NOW - st.session_state.last_fetch).total_seconds() > 300:
    # Fetch latest status data
    texts = fetch_status_data()
    if not texts:
        # If no data fetched, clear the index
        st.session_state.index = None
        st.session_state.status_texts = []
    else:
        # Only rebuild index if we don't have one yet
        if st.session_state.index is None:
            st.session_state.index = build_index_from_texts(texts)
        st.session_state.status_texts = texts
    st.session_state.last_fetch = NOW

# Streamlit UI components
st.title("🔍 Live Status Q&A Chatbot")
query = st.text_input("Ask a question about the system status:")

if st.button("Get Answer"):
    if not query.strip():
        st.warning("Please enter a question.")
    else:
        answer = answer_query(query, st.session_state.index, st.session_state.status_texts)
        st.write(f"**Answer:** {answer}")

Let’s walk through what this does:

Session state initialization: We use st.session_state to store index, status_texts, and last_fetch across runs (user interactions). On first load, these keys won’t exist, so we initialize them: set index to None, status_texts to an empty list, and last_fetch to None.
Periodic data refresh: We then check the current time (NOW) against st.session_state.last_fetch. If we’ve never fetched data before, or if more than 300 seconds (5 minutes) have passed since the last fetch, we proceed to refresh the data. In a more advanced implementation, you might detect if new texts were added and update or rebuild accordingly. For our basic app, we assume changes are minimal or require a restart to incorporate fully.
UI setup: Next, we set up the interface. st.title displays the app title at the top. st.text_input creates a text box for the user’s question with the prompt "Ask a question about the system status:". The variable query will hold whatever the user types.
Answering questions: We use st.button("Get Answer") to render a button. When the button is clicked, this returns True and triggers our logic to generate an answer by calling our answer_query function with the current index and status_texts from the session state.

With this interface, users can type in questions like “Were there any incidents today?” and receive answers sourced from the status page data. Our app automatically refreshes the data every 5 minutes, so if a new incident is published on the status page, it will be picked up and used in answers..

Deployment

Running the app locally is as simple as executing the Streamlit command in your terminal. From your project directory, run:

Running Locally

Copied!

streamlit run app.py

This will start a local web server (usually at http://localhost:8501) where you can interact with the Q&A interface in your browser.

To deploy the application for others to use, you have a few options:

Streamlit Community Cloud: Streamlit provides a free cloud service to deploy apps directly from a GitHub repo. You can push your code to a repository and share your app through Streamlit sharing. Just remember to add your API keys as secrets on the platform (never hard-code them in a public repo).
Self-host on a VM or Docker: You could run the Streamlit app on a cloud VM or inside a Docker container. Ensure you securely provide the required environment variables. For a Docker deployment, you might bake the environment variables into the container or supply them at runtime, and then run streamlit run as the container’s entrypoint.
Heroku or Other PaaS: Streamlit apps can often be deployed on platforms like Heroku with a few configuration tweaks. Similarly, you could use AWS EC2, Google Cloud Run, or Azure App Service. The key is to ensure your Python environment has all the dependencies and the app is launched with the proper Streamlit command.

No matter which deployment method you choose, make sure to secure your API keys. Do not expose your OpenAI API key (or any Better Stack token) publicly. Use environment variables or the secret management features of your hosting platform to keep credentials safe.

Once deployed, you and your team can visit the app’s URL, ask questions about your system status, and get real-time answers. This can be incredibly useful for support teams or on-call engineers who need to quickly query the latest status of various services via a conversational interface.

Final Thoughts

In this tutorial, we built a fully functional Retrieval-Augmented Generation application step by step. We started by explaining what RAG is and why it’s useful for injecting live data into LLM responses. Using a public status page API, we fetched real-time status updates and embedded them into vectors so that our GPT model could “understand” and use that information. We implemented a simple yet effective similarity search with FAISS to retrieve relevant context for any given user query. Finally, we integrated OpenAI’s Chat Completion API to generate answers that are grounded in the latest data, and we put a user-friendly interface on top with Streamlit.

This RAG approach can be extended beyond status pages. You could integrate other data sources - such as documents, knowledge bases, monitoring alerts, or your logs and metrics available via Better Stack APIs - by indexing their content in the same way. The core pattern remains: retrieve, then generate. For production use, you might consider more sophisticated vector database solutions, better caching of embeddings, or streaming responses for long answers. You might also implement follow-up questions or conversation memory by extending the prompt with previous Q&A turns.

Feel free to experiment with the code: ask different questions, tweak the number of retrieved documents, the format of retrieved data, or switch to a different model for potentially better answers. With a working RAG application at your fingertips, you have a foundation to build smarter, data-informed AI assistants that keep your users up-to-date with the latest information.

Happy coding! 🚀

Got an article suggestion? Let us know

Getting Started with Model Context Protocol (MCP)

Learn about Model Context Protocol (MCP), the new standard that revolutionizes how AI systems connect to external tools and data, and how to implement it in your applications.

→

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contents

Introduction

Prerequisites

Project Setup

Getting Started

RAG Concepts

Fetching Data from the Status API

Embedding and Indexing

GPT Integration

Streamlit Interface

Deployment

Final Thoughts

Make your mark

Join the writer's program

Build on top of Better Stack