No matter which deployment method you choose, make sure to secure your API keys. Do not expose your OpenAI API key (or any Better Stack token) publicly. Use environment variables or the secret management features of your hosting platform to keep credentials safe.
Introduction
Retrieval-Augmented Generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information from outside their training data. By pulling in relevant external documents or data when answering queries, RAG helps LLMs provide up-to-date, factual answers using domain-specific sources rather than relying on stale training data alone. This approach not only reduces AI hallucinations (made-up facts) but also means we don’t have to constantly retrain the model on new information.
In this tutorial, we’ll build a RAG application from scratch using Python. Our app will integrate a public status page API as a live data source and OpenAI’s GPT model for generation. Imagine a chatbot that can answer questions like “What’s the latest system incident?” or “How is Service X doing today?” by retrieving real-time status updates and feeding them into an LLM. We will walk through every step:
- Fetching data from the status page API.
- Creating a vector store of embeddings for efficient similarity search.
- Prompting GPT with the retrieved context.
- Building an interactive interface to ask questions.
By the end, you will have a working Python RAG application that augments GPT with live status data. This guide is written in a clear, step-by-step format, with code snippets and explanations for each part of the process.
Let’s dive in!
Prerequisites
Before we start coding, make sure you have the following:
- Python 3.12+ – Installed and up to date - RAG benefits from the latest Python features.
- OpenAI API Key – with access to GPT-4 or similar models. Sign up at OpenAI and create an API key, then set it as an environment variable
OPENAI_API_KEY
. - A public Better Stack status page – We will use https://status.betterstack.com in this article, but feel free to use your own!
- Basic familiarity with Python is helpful. No prior experience with RAG is required – we’ll explain each concept as we implement it.
- Libraries/Dependencies – We’ll use the
openai
Python client,requests
for HTTP calls, andstreamlit
for the UI. We’ll also usefaiss-cpu
(Facebook AI Similarity Search) for vector indexing, plusnumpy
for array handling.
Project Setup
First, let's create a proper project structure and set up our environment:
# Create project directory
mkdir status-page-rag-app
cd status-page-rag-app
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies using pip
pip install openai requests streamlit faiss-cpu numpy python-dotenv
pip freeze > requirements.txt
# Create empty app.py and .env files
touch app.py
touch .env
Add your OpenAI API key to the .env
file:
OPENAI_API_KEY=your_api_key_here
Now we're all set and can start building our application! 🚀
Getting Started
Let’s set up our project and verify access to the APIs. Start by importing the necessary modules and initializing the OpenAI client. We will use the OpenAI v1 client interface, which involves creating a client object rather than using module-level functions.
To keep things simple, we will be building the whole application in a single app.py
file. Everything should work nicely by just appending the code block from each chapter.
import os
from dotenv import load_dotenv
from openai import OpenAI, AuthenticationError
# Load required credentials from environment variables
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
raise Exception("Please set the OPENAI_API_KEY environment variable.")
# Initialize OpenAI API client
try:
client = OpenAI(api_key=OPENAI_API_KEY)
except AuthenticationError:
raise Exception("Failed to authenticate with OpenAI API. Check your API key in the OPENAI_API_KEY environment variable.")
In the code above, we retrieve the OpenAI API key from an environment variable and ensure it’s present. We then instantiate an OpenAI client with that API key. If the key is missing or invalid, an exception is raised so that you know to set or correct the API key before proceeding. By using OpenAI(api_key=...)
to create a client, we can later call methods like client.embeddings.create()
and client.chat.completions.create()
to interact with the API.
It's best to test each part of the application to make sure everything is working as expected. Add the following code temporarily:
# ...
print("Authenticated with OpenAI API. Available models:")
for model in client.models.list().data:
print(f"- {model.id}")
Then, run the script:
python app.py
When successful, you will see output that looks like:
Authenticated with OpenAI API. Available models:
- gpt-4
- gpt-4-turbo
- gpt-3.5-turbo
- text-embedding-3-small
- text-embedding-3-large
- dall-e-3
- whisper-1
Now, with our environment set up and the OpenAI client ready, we can move on to understanding the RAG architecture we’ll implement.
RAG Concepts
A RAG application consists of two main parts: retrieval of relevant context and generation of answers using that context. In practice, building a RAG pipeline involves a few key steps:
- Indexing (Embedding the Data): First, gather the external data that the LLM should know about – in our case, status page updates. We then convert this text data into vector embeddings using an embedding model. These embedding vectors capture the semantic meaning of the text in a high-dimensional space.
- Retrieval (Similarity Search): When a user asks a question, we embed the query in the same vector space and search for the most similar embeddings among our data. This lets us retrieve the most relevant pieces of text (e.g. recent incident reports or status messages) that might contain the answer. We use a vector index for efficient similarity search, so this step is fast even if we have many data points.
- Generation (Augmented Answering): Finally, we feed the retrieved context along with the user’s question into the LLM (GPT) and ask it to formulate an answer. By providing the LLM with relevant up-to-date information from the status API, we ground its response in real data. The GPT model will incorporate the provided status details when generating the answer, resulting in a more accurate and context-aware response.
In simpler terms, our app will fetch the latest status page data (such as incident titles or updates), store it in an embedding-based knowledge base, and for each query, find the best matches from that knowledge base to help GPT answer correctly. This means if the status page says “Service A is currently down due to network issues”, and a user asks “Is Service A operational?”, the app can retrieve that update and GPT will respond with something like “Service A is experiencing downtime due to network issues as per the latest status update.” The power of RAG is that the LLM’s answer is augmented with real-time data rather than just its training memory.
With the concept clear, let’s start coding the retrieval pipeline by fetching data from the status API.
Fetching Data from the Status API
Our first task is to retrieve live data from the status page. The following code defines a function to fetch the latest status reports via the API. We use Python’s requests
library to make an HTTP GET request. In our case, we’ll call the Better Stack public status page endpoint /index.json
, which does not require any authentication. We include basic error handling to catch network issues or non-200 HTTP responses, and then parse the JSON data to extract the relevant pieces of information (such as incident titles and updates).
import requests
def fetch_status_data():
# JSON endpoint of Better Stack status page (no auth required)
# Feel free to replace it by your own status page
url = "https://status.betterstack.com/index.json"
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
data = response.json()
except Exception as e:
print(f"Error fetching data: {e}")
return []
# Extract all included reports, updates, and resource names
included = data.get("included", [])
reports = [i for i in included if i["type"] == "status_report"]
updates = {i["id"]: i for i in included if i["type"] == "status_update"}
resources = {
r["id"]: r["attributes"]["public_name"]
for r in included if r["type"] == "status_page_resource"
}
# Build status texts for each report
status_texts = []
for report in reports:
a = report["attributes"]
title = a.get("title", "Untitled")
state = a.get("aggregate_state", "unknown")
rtype = a.get("report_type", "manual")
affected = [r["status_page_resource_id"] for r in a.get("affected_resources", [])]
affected_names = [resources.get(rid, f"resource {rid}") for rid in affected]
# Format all related updates into messages
updates_list = report.get("relationships", {}).get("status_updates", {}).get("data", [])
messages = []
for ref in updates_list:
u = updates.get(ref["id"])
if u:
ts = u["attributes"].get("published_at", "")
msg = u["attributes"].get("message", "").strip()
messages.append(f"Update at {ts}:\n{msg}")
# Format all relevant information into a single text
text = f"Incident: {title}\nState: {state}\n"
if rtype == "maintenance":
start = a.get("starts_at", "unknown")
end = a.get("ends_at", "unknown")
text += f"Time: {start} - {end}\n"
text += f"Affected: {', '.join(affected_names) or 'unknown'}\n\n"
text += "\n\n".join(messages)
status_texts.append(text)
return status_texts
A few things to note in this snippet:
- We fetch JSON from the Better Stack public status page.
- We combine the incident title, state, affected services, and all updates into one readable text block. This format will ensure no context is lost, and can be easily understood.
- Only maintenance reports show their start/end times. Regular incidents will contain full timeline of status updates.
- These complete report summaries are collected into
status_texts
, ready for indexing.
# ...
print("Fetching status data...")
texts = fetch_status_data()
print(f"Fetched {len(texts)} status reports.")
if texts:
print(f"Latest status report:\n{texts[-1]}")
Run the script:
python app.py
You should see output that looks like:
Fetching status data...
Fetched 3 status reports.
Latest status report:
Incident: Database Performance Issues
State: resolved
Affected: API, Dashboard
Update at 2024-01-15T10:30:00Z:
We are investigating reports of slow database queries affecting our API response times.
Now that we can fetch the status data, let’s create the vector store (index) for our retrieved texts.
Embedding and Indexing
With status_texts
(the list of incident titles or messages) in hand, the next step is to embed these texts into vectors and build a vector index. We will use OpenAI’s text embedding model (text-embedding-3-small) to convert each piece of text into a 1536-dimensional embedding vector. These embeddings numerically represent the semantic content of the text, enabling similarity comparisons.
If you just started with AI, this step may sound overly technical and academic. Imagine the embedding vector as converting text into a special kind of number that tells a computer what the text means. It's like turning a sentence into a dot in space — and dots with similar meanings are close together. So later, when someone asks a question, the computer can look for the closest matching dot (text) — even if the words are different — because the meanings are nearby.
We’ll then use FAISS (Facebook AI Similarity Search) to create an index for fast nearest-neighbor search among those vectors. FAISS allows us to efficiently find which stored vectors are closest to a given query vector – exactly what we need for retrieval. If you installed faiss-cpu
as suggested, we can proceed to use it.
Let’s write a function to build the index from a list of status texts. This function will call the OpenAI embeddings API and handle edge cases like empty inputs or API errors. We’ll also implement some best practices: only (re)build the index when needed (to avoid redundant computation) and handle empty embedding results gracefully.
import numpy as np
import faiss
EMBEDDING_MODEL = "text-embedding-3-small"
def build_index_from_texts(texts):
"""Generate embeddings for the given texts and build a FAISS index. Returns the index or None."""
if not texts:
# No texts to index; return None (clear any existing index)
return None
# Request embeddings from OpenAI
try:
embed_response = client.embeddings.create(input=texts, model=EMBEDDING_MODEL)
except Exception as e:
print(f"OpenAI Embedding API error: {e}")
return None
# Safety check: ensure we got embeddings for the texts
if not embed_response.data:
print("Warning: No embeddings returned for the input texts.")
return None
# Extract embedding vectors from the response
embeddings = [record.embedding for record in embed_response.data]
# Build a FAISS index (L2 distance for similarity search)
dimension = len(embeddings[0])
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings, dtype='float32'))
return index
In this code:
- We define a constant
EMBEDDING_MODEL
for clarity. We’re using the Ada v2 embedding model (text-embedding-3-small
), which is well-suited for semantic search. - The function
build_index_from_texts
first checks iftexts
is empty. If so, it immediately returnsNone
. This way, we avoid building an index when there’s no data (and we can useNone
to indicate “no index” when data is absent). - We call the OpenAI embeddings API (
client.embeddings.create
) with the list of texts. Using the OpenAI v1 client, this returns a response object containing a.data
list of embedding results. We wrap the call in a try/except to catch any API errors (like rate limits or authentication issues) and print an error message instead of crashing. - After the API call, we check
if not embed_response.data:
. This is the graceful handling of empty embeddings — in the unlikely event the API returns an empty result (which could happen if the input list was empty or some other issue), we log a warning and returnNone
. This ensures our code doesn’t proceed with no data. - If we have embeddings, we extract them into a Python list. Each
record
inembed_response.data
has an.embedding
attribute that is a list of floats (the vector). We use a list comprehension to collect all embedding vectors. - We then initialize a FAISS
IndexFlatL2
with the embedding dimension.IndexFlatL2
is a simple index that computes L2 (Euclidean) distances — fine for moderate-sized data. We convert our list of embeddings to a NumPy array of type float32 (required by FAISS) and add it to the index. - Finally, we return the built index object. If at some point we want to update the data, we can rebuild or update this index with new embeddings.
At this point, we have the capability to build (or rebuild) our vector index from the latest status texts. We will make sure to only build the index when necessary. In practice, this means we’ll call build_index_from_texts
when we first fetch data or if our index is currently None
. If we already have an index and the data hasn’t changed, we can reuse it without rebuilding — saving time and API calls. We’ll see this in action in the Streamlit integration.
GPT Integration
Now comes the generation part of our RAG app. We need to accept a user’s question, retrieve relevant context from our indexed data, and then query the GPT model to generate an answer using that context.
The plan is as follows:
- Embed the user’s query into the same vector space as our data.
- Search the FAISS index to find the most similar texts from
status_texts
to the query. - Construct a prompt for GPT that includes these retrieved texts as context.
- Call the OpenAI Chat Completions API to get an answer.
We’ll write a function answer_query
that performs these steps. This function will need access to our index and the original texts list (to map indices back to text). We’ll use a system message to instruct the assistant to use the provided context, and a user message that includes the actual question along with the retrieved context.
def answer_query(query, index, texts, model="gpt-4", k = 3):
"""Given a user query and our knowledge index, retrieve relevant data and get an answer from GPT."""
if index is None or not texts:
return "I don't have any status data to answer that question."
# 1. Embed the user query
try:
query_response = client.embeddings.create(input=[query], model=EMBEDDING_MODEL)
except Exception as e:
print(f"Error embedding query: {e}")
return "Sorry, I couldn't process your question at this time."
query_vector = np.array(query_response.data[0].embedding, dtype='float32')
# 2. Retrieve top-k similar texts from the index
distances, indices = index.search(query_vector.reshape(1, -1), k)
retrieved_texts = [texts[i] for i in indices[0] if i < len(texts)]
# 3. Construct the prompt with context
context = "\n".join(retrieved_texts)
system_msg = {
"role": "system",
"content": "You are a helpful assistant answering questions about a system's status. Use the provided status updates to give an accurate answer."
}
user_msg = {
"role": "user",
"content": f"Here are some relevant status updates:\n{context}\n\nQuestion: {query}"
}
# 4. Generate answer using OpenAI ChatCompletion
try:
chat_response = client.chat.completions.create(
model=model,
messages=[system_msg, user_msg],
temperature=0
)
except Exception as e:
print(f"OpenAI ChatCompletion API error: {e}")
return "Error: Unable to get a response from the language model."
answer = chat_response.choices[0].message.content.strip()
return answer
Let’s break down how answer_query
works:
- Check for data availability: If the
index
isNone
or we have notexts
, the function immediately returns a message indicating that there is no status data to answer the question. This prevents attempting a search or GPT query when we have no context. - Embed the query: We use
client.embeddings.create
again, this time with the user’s query as input. The resultquery_response
should contain one embedding accessible asquery_response.data[0]
. We convert that embedding into a NumPy vectorquery_vector
. This step is also wrapped in try/except to handle any errors in embedding the query. - Retrieve relevant texts: We use the FAISS index’s
search
method to find the nearest neighbors to the query vector. We ask for the top 3 matches (k = 3
). Theindex.search
returns a tuple of(distances, indices)
. We take the indices of the results and map them back to the originaltexts
list to get the actual status update texts that are most relevant to the query. - Construct the prompt: We then prepare the messages for GPT. We join the retrieved texts with newline characters to form a context string. We create a
system
message to prime the assistant with instructions (telling it to use the status updates to answer accurately), and auser
message that contains the actual question prefixed by the retrieved status updates. By structuring the prompt this way, we give GPT the relevant information to use when formulating its answer. - Generate the answer: Finally, we call the OpenAI Chat Completion API (
client.chat.completions.create
) with our messages. We settemperature=0
for a deterministic answer. If the API call succeeds, we extract the assistant’s reply (chat_response.choices[0].message.content
) and return it. If there’s an exception (e.g., API error or timeout), we catch it and return an error message string.
# ...
question = "Is there any maintenance planned for next month?"
print(f"Q: {question}")
texts = fetch_status_data()
answer = answer_query(question, build_index_from_texts(texts), texts)
print(f"A: {answer}")
Run the script:
python app.py
You should see output that looks like:
Q: Is there any maintenance planned for next month?
A: Based on the provided status updates, there is no information about any maintenance planned for the next month. The most recent incident updates do not mention any upcoming maintenance activities. If you have any specific concerns or questions, you can reach out to the support team at hello@betterstack.com for further assistance.
With this function, our backend logic for Q&A is complete: we can take a question, find context, and get an answer from GPT. The next step is to integrate everything into an interactive interface.
Streamlit Interface
At this point, we have all the core pieces of our application: the ability to fetch data, build and search the vector index, and get GPT to answer using retrieved context. Now, let's tie it all together in a Streamlit interface for interactivity.
Building the Streamlit Interface: Streamlit makes it straightforward to create a simple web interface for our app. We will use Streamlit to allow the user to input questions and to display the answers. We will also utilize st.session_state
to store our index and data between interactions, so we don’t rebuild everything on every question. Additionally, we’ll implement a refresh interval for the data (every 5 minutes) to keep our context updated.
Here’s the code for the Streamlit app. It initializes the state, fetches data when needed, and defines the UI components:
import streamlit as st
from datetime import datetime, timezone
# Initialize session state for persistence
if "index" not in st.session_state:
st.session_state.index = None
st.session_state.status_texts = []
st.session_state.last_fetch = None
# Refresh data every 5 minutes
NOW = datetime.now(timezone.utc)
if st.session_state.last_fetch is None or (NOW - st.session_state.last_fetch).total_seconds() > 300:
# Fetch latest status data
texts = fetch_status_data()
if not texts:
# If no data fetched, clear the index
st.session_state.index = None
st.session_state.status_texts = []
else:
# Only rebuild index if we don't have one yet
if st.session_state.index is None:
st.session_state.index = build_index_from_texts(texts)
st.session_state.status_texts = texts
st.session_state.last_fetch = NOW
# Streamlit UI components
st.title("🔍 Live Status Q&A Chatbot")
query = st.text_input("Ask a question about the system status:")
if st.button("Get Answer"):
if not query.strip():
st.warning("Please enter a question.")
else:
answer = answer_query(query, st.session_state.index, st.session_state.status_texts)
st.write(f"**Answer:** {answer}")
Let’s walk through what this does:
- Session state initialization: We use
st.session_state
to storeindex
,status_texts
, andlast_fetch
across runs (user interactions). On first load, these keys won’t exist, so we initialize them: setindex
toNone
,status_texts
to an empty list, andlast_fetch
toNone
. - Periodic data refresh: We then check the current time (
NOW
) againstst.session_state.last_fetch
. If we’ve never fetched data before, or if more than 300 seconds (5 minutes) have passed since the last fetch, we proceed to refresh the data. In a more advanced implementation, you might detect if new texts were added and update or rebuild accordingly. For our basic app, we assume changes are minimal or require a restart to incorporate fully. - UI setup: Next, we set up the interface.
st.title
displays the app title at the top.st.text_input
creates a text box for the user’s question with the prompt "Ask a question about the system status:". The variablequery
will hold whatever the user types. - Answering questions: We use
st.button("Get Answer")
to render a button. When the button is clicked, this returnsTrue
and triggers our logic to generate an answer by calling ouranswer_query
function with the currentindex
andstatus_texts
from the session state.
With this interface, users can type in questions like “Were there any incidents today?” and receive answers sourced from the status page data. Our app automatically refreshes the data every 5 minutes, so if a new incident is published on the status page, it will be picked up and used in answers..
Deployment
Running the app locally is as simple as executing the Streamlit command in your terminal. From your project directory, run:
streamlit run app.py
This will start a local web server (usually at http://localhost:8501) where you can interact with the Q&A interface in your browser.
To deploy the application for others to use, you have a few options:
- Streamlit Community Cloud: Streamlit provides a free cloud service to deploy apps directly from a GitHub repo. You can push your code to a repository and share your app through Streamlit sharing. Just remember to add your API keys as secrets on the platform (never hard-code them in a public repo).
- Self-host on a VM or Docker: You could run the Streamlit app on a cloud VM or inside a Docker container. Ensure you securely provide the required environment variables. For a Docker deployment, you might bake the environment variables into the container or supply them at runtime, and then run
streamlit run
as the container’s entrypoint. - Heroku or Other PaaS: Streamlit apps can often be deployed on platforms like Heroku with a few configuration tweaks. Similarly, you could use AWS EC2, Google Cloud Run, or Azure App Service. The key is to ensure your Python environment has all the dependencies and the app is launched with the proper Streamlit command.
Once deployed, you and your team can visit the app’s URL, ask questions about your system status, and get real-time answers. This can be incredibly useful for support teams or on-call engineers who need to quickly query the latest status of various services via a conversational interface.
Final Thoughts
In this tutorial, we built a fully functional Retrieval-Augmented Generation application step by step. We started by explaining what RAG is and why it’s useful for injecting live data into LLM responses. Using a public status page API, we fetched real-time status updates and embedded them into vectors so that our GPT model could “understand” and use that information. We implemented a simple yet effective similarity search with FAISS to retrieve relevant context for any given user query. Finally, we integrated OpenAI’s Chat Completion API to generate answers that are grounded in the latest data, and we put a user-friendly interface on top with Streamlit.
This RAG approach can be extended beyond status pages. You could integrate other data sources - such as documents, knowledge bases, monitoring alerts, or your logs and metrics available via Better Stack APIs - by indexing their content in the same way. The core pattern remains: retrieve, then generate. For production use, you might consider more sophisticated vector database solutions, better caching of embeddings, or streaming responses for long answers. You might also implement follow-up questions or conversation memory by extending the prompt with previous Q&A turns.
Feel free to experiment with the code: ask different questions, tweak the number of retrieved documents, the format of retrieved data, or switch to a different model for potentially better answers. With a working RAG application at your fingertips, you have a foundation to build smarter, data-informed AI assistants that keep your users up-to-date with the latest information.
Happy coding! 🚀
Make your mark
Join the writer's program
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for us
Build on top of Better Stack
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github