DuckDB is a fast, in-process analytical database built for modern data analysis. Unlike other databases, DuckDB runs directly inside your Python application, offering high-performance querying without needing a separate server.
Thanks to its columnar storage and vectorized execution, it's especially powerful for analytical workloads—often outperforming tools like SQLite for complex queries.
In this guide, you'll set up a Python-based data analysis workflow using DuckDB. You'll explore key features, learn how to use them effectively, and tailor your setup for performance and flexibility.
Prerequisites
To follow along, make sure you have Python 3.13 or newer installed. You should also be comfortable with basic SQL and data analysis in Python.
Getting started with DuckDB
In this section, you’ll set up a Python environment for working with DuckDB and run your first queries.
Start by creating a new project directory and navigating into it:
mkdir duckdb-analysis && cd duckdb-analysis
Then, set up a virtual environment:
python3 -m venv venv
Activate the virtual environment according to your operating system:
source venv/bin/activate
Now install the latest version of DuckDB. This guide works with DuckDB version 1.2.x, current at the time of writing:
pip install duckdb
Create a new analyzer.py
file in the root of your project directory, and populate it with the following contents:
import duckdb
# Create a connection to an in-memory database
conn = duckdb.connect()
# Execute a simple query and fetch the result
result = conn.execute("SELECT 'Hello, world!' AS message").fetchall()
print(result)
This script creates an in-memory DuckDB connection and runs a basic SQL query. The result is returned as a list of tuples, with each tuple representing a row from the query result.
Now, let’s use the same connection in a second file called main.py
:
from analyzer import conn
# Execute a simple query
result = conn.execute("SELECT 42 AS answer").fetchall()
print(result)
In this example, main.py
reuses the conn
object from analyzer.py
to run a query. The result is fetched as a list of tuples. Separating the connection logic keeps your code clean and modular.
Once you save the file, execute the program using the following command:
python main.py
You should observe the following output:
[('Hello, world!',)]
[(42,)]
Each query returns a list of tuples—DuckDB’s default format when using .fetchall()
. This makes it easy to directly work with small datasets in Python without needing additional libraries.
Now that you can connect to DuckDB and run queries, you're ready to work with real data.
Working with external data sources in DuckDB
DuckDB excels at querying data from external sources without loading entire datasets into memory. Unlike traditional database systems that require importing data before working with it, DuckDB can directly query files in various formats.
This "zero-copy" approach offers significant advantages:
- Reduced memory usage: DuckDB only loads the parts of the file your query needs.
- No import/export overhead: You can query the data where it already lives.
- Simplified workflows: You can run a single SQL query across multiple files or formats.
DuckDB supports many file formats, especially ones designed for analytics, like Parquet and CSV. Let’s look at how you can use these external sources.
To try this out, place a sample CSV file in your project directory. You can create one called data.csv
with the following content:
id,name,score
1,Alice,85
2,Bob,90
3,Charlie,78
Now, create a new file called query_csv.py
and add this code:
import duckdb
# Query data directly from a CSV file
result = duckdb.query("SELECT * FROM 'data.csv'").fetchall()
print(result)
This code runs a SQL query directly on the CSV file—no need to load the data into memory or convert it into another format.
You should see output like this:
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]
Querying Parquet files
DuckDB also works great with Parquet, a popular big data and analytics format.
First, create a Parquet version of your existing CSV file to try it out.
Add the following to a new file called convert_to_parquet.py
:
import duckdb
# Convert CSV to Parquet
duckdb.query("COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT 'parquet')")
In this code, you use DuckDB to read the contents of data.csv
and write them to a new file called data.parquet
in Parquet format.
Run the script:
python convert_to_parquet.py
This creates a data.parquet
file in your project directory.
Now create a new file called query_parquet.py
and add the following code:
import duckdb
# Query data directly from a Parquet file
result = duckdb.query("SELECT * FROM 'data.parquet'").fetchall()
print(result)
Then run the script:
python query_parquet.py
The output will follow the same structure—a list of tuples:
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]
This ability to query external files directly makes DuckDB so powerful for data analysis. You don’t need a separate database server; you can start working with your data immediately.
Exploring SQL features with external files
Now that you can query CSV and Parquet files directly, let's explore DuckDB's SQL capabilities. DuckDB is powerful because you can apply the full range of SQL operations to external files without importing them first. This means you can filter, sort, transform, and aggregate your data with just a few lines of code.
Let's build on our previous example with the data.csv file by starting with more useful queries. We'll progressively add more sophisticated SQL features to show what DuckDB can do.
Create a new file called explore_data.py
and let's start with some practical operations:
Filtering rows
When analyzing data, you often need to focus on specific subsets that meet certain criteria. DuckDB makes this easy with SQL's WHERE clause.
import duckdb
# Select only students with a score above 80
result = duckdb.query("SELECT * FROM 'data.csv' WHERE score > 80").fetchall()
print("Students with score > 80:")
print(result)
The WHERE
clause acts like a filter, only returning rows that match your condition. You can use comparison operators like >
, <
, =
, <=
, >=
, and <>
(not equal), along with logical operators like AND
, OR
, and NOT
to create more complex filters.
Run the file:
python explore_data.py
Students with score > 80:
[(1, 'Alice', 85), (2, 'Bob', 90)]
The output shows only the students who scored above 80, excluding those who didn’t meet the condition.
Next, let’s sort the results to see the highest scores first.
Sorting results
Data analysis often requires organizing results in a specific order. SQL's ORDER BY
clause lets you sort your data based on one or more columns.
Update your file to include this sorting code:
import duckdb
# Select only students with a score above 80
result = duckdb.query("SELECT * FROM 'data.csv' WHERE score > 80").fetchall()
print("Students with score > 80:")
print(result)
#Sort students by score in descending order
sorted_result = duckdb.query("SELECT * FROM 'data.csv' ORDER BY score DESC").fetchall()
print("\nStudents sorted by score (highest first):")
print(sorted_result)
The ORDER BY
clause sorts your results based on the specified column. By default, sorting is in ascending order (ASC
), but you can use DESC
for descending order as shown here. You can also sort by multiple columns by listing them separated by commas, like ORDER BY column1, column2 DESC
.
Run the file to see the sorted results:
python explore_data.py
# Previous output...
Students sorted by score (highest first):
[(2, 'Bob', 90), (1, 'Alice', 85), (3, 'Charlie', 78)]
Adding calculated columns
SQL makes it easy to create new columns using expressions. This lets you transform your data and generate new insights directly in your query.
Update your file with the following code to add a calculated column:
import duckdb
...
# Add 10 bonus points to each score
bonus_result = duckdb.query("""
SELECT
name,
score,
score + 10 AS boosted_score
FROM 'data.csv'
""").fetchall()
print("\nScores with 10 bonus points:")
print(bonus_result)
The AS
keyword lets you name the result of an expression, creating a new column in your output. You can use arithmetic operators like +
, -
, *
, and /
, as well as many built-in functions. These calculated columns exist only in your query results and don't modify the original data file.
Rerun the file:
python explore_data.py
# Previous output...
Scores with 10 bonus points:
[('Alice', 85, 95), ('Bob', 90, 100), ('Charlie', 78, 88)]
As you can see, the query adds 10 points to each student's score and returns it as a new column called boosted_score
. The original score
values stay unchanged—this transformation happens only in the query result.
Next, let’s take this further and assign letter grades based on each student's score.
Using CASE expressions
For more complex transformations, SQL provides the CASE
expression. This works similarly to if-else statements in programming languages, allowing you to apply conditional logic in your queries.
Add this code to your file:
import duckdb
# Previous code...
# Assign grades based on score
graded_result = duckdb.query("""
SELECT
name,
score,
CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 80 THEN 'B'
ELSE 'C'
END AS grade
FROM 'data.csv'
""").fetchall()
print("\nStudents with letter grades:")
print(graded_result)
The CASE
expression evaluates conditions in order and returns the result for the first true condition. In this example, we convert numeric scores to letter grades based on thresholds. This is particularly useful for categorizing data or implementing business rules in your queries.
Run the file to see the letter grades:
python explore_data.py
# Previous output...
Students with letter grades:
[('Alice', 85, 'B'), ('Bob', 90, 'A'), ('Charlie', 78, 'C')]
Now that you’ve learned how to transform and categorize your data, let’s look at how to summarize it using aggregate functions like AVG()
and COUNT()
.
Aggregating data
SQL truly shines when it comes to summarizing data. Aggregate functions let you calculate statistics across multiple rows, giving you insights into your entire dataset or groups within it.
Add this aggregation code:
import duckdb
# Previous code...
# Calculate summary statistics
stats_result = duckdb.query("""
SELECT
COUNT(*) AS student_count,
AVG(score) AS average_score,
MIN(score) AS lowest_score,
MAX(score) AS highest_score
FROM 'data.csv'
""").fetchall()
print("\nSummary statistics:")
print(stats_result)
# Count students by grade
grade_counts = duckdb.query("""
SELECT
CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 80 THEN 'B'
ELSE 'C'
END AS grade,
COUNT(*) AS count
FROM 'data.csv'
GROUP BY grade
ORDER BY grade
""").fetchall()
print("\nCount of students by grade:")
print(grade_counts)
The first query uses aggregate functions like COUNT()
, AVG()
, MIN()
, and MAX()
to summarize the entire dataset. The second query combines CASE
with GROUP BY
to count students in each grade category. The GROUP BY
clause divides your data into groups, and the aggregate functions then operate on each group separately.
Run the file again:
python explore_data.py
# Previous output...
Summary statistics:
[(3, 84.33333333333333, 78, 90)]
Count of students by grade:
[('A', 1), ('B', 1), ('C', 1)]
Now that you can work with an external data source, you’ll move on to working with multiple data sources.
Working with multiple data sources
Up to this point, you’ve been working with just one data file. But in real-world analysis, it’s common to pull data from several sources. That’s where SQL joins come in—they let you combine related data easily. And with DuckDB, doing this is simple and efficient.
To try it out, let’s add another CSV file with course information that matches up with our student data. Create a new file named courses.csv
and add the following content:
id,course,credits
1,Math,4
2,Science,3
3,History,3
Now, let's create a new file called join_data.py
to demonstrate how to join these datasets:
import duckdb
# Query data from both CSV files and join them
join_result = duckdb.query("""
SELECT
s.id,
s.name,
c.course,
s.score,
c.credits
FROM
'data.csv' s
JOIN
'courses.csv' c ON s.id = c.id
""").fetchall()
print("Joined student and course data:")
print(join_result)
This query joins our student data with course information based on the matching id
field. Note how we're using table aliases (s
for students and c
for courses) to make the query more readable.
Run the file:
python join_data.py
Joined student and course data:
[(1, 'Alice', 'Math', 85, 4), (2, 'Bob', 'Science', 90, 3), (3, 'Charlie', 'History', 78, 3)]
The output shows a combined dataset with fields from both files. DuckDB has matched each student with their corresponding course based on the id
field.
Aggregating data from multiple sources
Joins become even more powerful when combined with aggregation. Let's see how to calculate a weighted score for each student based on their course credits:
import duckdb
# Query data from both CSV files and join them
join_result = duckdb.query("""
...
""").fetchall()
print("Joined student and course data:")
print(join_result)
# Calculate weighted scores
weighted_result = duckdb.query("""
SELECT
s.name,
s.score,
c.credits,
s.score * c.credits AS weighted_score
FROM
'data.csv' s
JOIN
'courses.csv' c ON s.id = c.id
ORDER BY
weighted_score DESC
""").fetchall()
print("\nWeighted scores by course credits:")
print(weighted_result)
This example shows how you can join data from multiple sources and perform calculations using fields from both tables. The weighted score multiplies each student's score by the number of credits for their course.
Run the file again:
python join_data.py
Joined student and course data:
[(1, 'Alice', 'Math', 85, 4), (2, 'Bob', 'Science', 90, 3), (3, 'Charlie', 'History', 78, 3)]
Weighted scores by course credits:
[('Alice', 85, 4, 340), ('Bob', 90, 3, 270), ('Charlie', 78, 3, 234)]
With DuckDB's ability to join data from multiple files, you can build sophisticated analyses without needing to import everything into a traditional database first. This makes DuckDB especially powerful for ad-hoc data exploration and analysis.
In the next section, we'll look at creating persistent tables and databases with DuckDB for more complex analytical workflows.
Creating persistent tables and databases
You've been working with external files and temporary in-memory databases—great for quick analysis. But when you need your data to stick around between sessions, it's time to create something more permanent. DuckDB makes this simple by allowing you to save databases to disk and create persistent tables.
To try this out, create a new file called persistent_db.py
and add the following code:
import duckdb
# Connect to a persistent database file
# If the file doesn't exist, it will be created
db = duckdb.connect('school.duckdb')
# Create a students table
db.execute("""
CREATE TABLE IF NOT EXISTS students (
id INTEGER PRIMARY KEY,
name VARCHAR,
score INTEGER
)
""")
# Insert data into the table
db.execute("""
INSERT INTO students VALUES
(1, 'Alice', 85),
(2, 'Bob', 90),
(3, 'Charlie', 78)
ON CONFLICT (id) DO NOTHING
""")
# Query the table
result = db.execute("SELECT * FROM students").fetchall()
print("Students in persistent database:")
print(result)
# Don't forget to close the connection
db.close()
In this example, we're creating a connection to a file called school.duckdb
. If this file doesn't exist, DuckDB will create it. The CREATE TABLE IF NOT EXISTS
statement ensures we only create the table if it doesn't already exist, and the ON CONFLICT
clause prevents duplicate entries if we run the script multiple times.
Run this script:
python persistent_db.py
Students in persistent database:
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]
If you check your project directory, you'll see a new file called school.duckdb
. This file contains your database with the students table.
Run the script again, and you'll notice the data doesn't get duplicated because of our ON CONFLICT
clause.
Let's create another script to modify our persistent data:
import duckdb
# Connect to the existing database
db = duckdb.connect('school.duckdb')
db.execute("INSERT INTO students VALUES (4, 'David', 95)")
db.execute("UPDATE students SET score = 88 WHERE name = 'Alice'")
result = db.execute("SELECT * FROM students ORDER BY id").fetchall()
print("Updated students table:")
print(result)
db.close()
Run this script:
python update_db.py
Updated students table:
[(1, 'Alice', 88), (2, 'Bob', 90), (3, 'Charlie', 78), (4, 'David', 95)]
The changes are now saved in the database file and will persist between sessions.
Creating persistent databases lets you build more structured and reliable data analysis workflows with DuckDB, all while keeping its fast, embedded nature.
With that in place, you’ll now explore how to integrate DuckDB with pandas and other popular Python data science libraries.
Integrating DuckDB with pandas
While DuckDB is powerful on its own, it works exceptionally well with pandas, Python's most popular data analysis library. This integration gives you the best of both worlds: DuckDB's fast SQL querying and pandas' rich data manipulation capabilities.
To get started, make sure you have pandas installed:
pip install pandas
Now, let's create a new file called duckdb_pandas.py
to explore this integration:
import duckdb
import pandas as pd
# Create a simple pandas DataFrame
students_df = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'score': [88, 90, 78, 95]
})
print("Original pandas DataFrame:")
print(students_df)
In this example, you create a basic pandas DataFrame with some student data. It includes IDs, names, and scores.
Run this script to see the DataFrame:
python duckdb_pandas.py
Original pandas DataFrame:
id name score
0 1 Alice 88
1 2 Bob 90
2 3 Charlie 78
3 4 David 95
An interesting feature of DuckDB is its ability to run SQL queries directly on pandas DataFrames. This lets you take advantage of SQL’s flexibility while working with in-memory data:
import duckdb
import pandas as pd
# Create a simple pandas DataFrame
students_df = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'score': [88, 90, 78, 95]
})
print("Original pandas DataFrame:")
print(students_df)
# Query the DataFrame using SQL
result = duckdb.query("""
SELECT
name,
score,
CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 80 THEN 'B'
ELSE 'C'
END AS grade
FROM students_df
WHERE score > 85
ORDER BY score DESC
""").fetchdf()
print("\nFiltered data with grades (as DataFrame):")
print(result)
In the highlighted code, you're using DuckDB to write a SQL query that runs directly on the students_df
DataFrame. The query filters out students who scored 85 or less, assigns them letter grades based on their scores, and sorts the result from highest to lowest score. The .fetchdf()
method returns the result as a new pandas DataFrame.
Run the script to see the result:
python duckdb_pandas.py
Original pandas DataFrame:
id name score
0 1 Alice 88
1 2 Bob 90
2 3 Charlie 78
3 4 David 95
Filtered data with grades (as DataFrame):
name score grade
0 David 95 A
1 Bob 90 A
2 Alice 88 B
Notice how we use the fetchdf()
method instead of fetchall()
to get the result as a pandas DataFrame rather than a list of tuples. DuckDB automatically recognizes DataFrame variables in your SQL queries, so you can reference them directly.
DuckDB works smoothly with pandas, giving you a flexible environment for data analysis. You get the best of both worlds—SQL’s clarity for queries and pandas’ wide range of tools for data manipulation. This makes it easy to use the right tool at the right stage of your workflow.
Final thoughts
DuckDB makes it easy to run fast, SQL-based analysis right inside your Python projects—no server required. From querying files and DataFrames to building persistent databases, it fits smoothly into modern data workflows.
To go further, check out the official docs at duckdb.org.
Make your mark
Join the writer's program
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for us
Build on top of Better Stack
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github