Back to Scaling Python Applications guides

DuckDB for Python: A beginner's guide

Stanley Ulili
Updated on April 4, 2025

DuckDB is a fast, in-process analytical database built for modern data analysis. Unlike other databases, DuckDB runs directly inside your Python application, offering high-performance querying without needing a separate server.

Thanks to its columnar storage and vectorized execution, it's especially powerful for analytical workloads—often outperforming tools like SQLite for complex queries.

In this guide, you'll set up a Python-based data analysis workflow using DuckDB. You'll explore key features, learn how to use them effectively, and tailor your setup for performance and flexibility.

Prerequisites

To follow along, make sure you have Python 3.13 or newer installed. You should also be comfortable with basic SQL and data analysis in Python.

Getting started with DuckDB

In this section, you’ll set up a Python environment for working with DuckDB and run your first queries.

Start by creating a new project directory and navigating into it:

 
mkdir duckdb-analysis && cd duckdb-analysis

Then, set up a virtual environment:

 
python3 -m venv venv

Activate the virtual environment according to your operating system:

 
source venv/bin/activate

Now install the latest version of DuckDB. This guide works with DuckDB version 1.2.x, current at the time of writing:

 
pip install duckdb

Create a new analyzer.py file in the root of your project directory, and populate it with the following contents:

analyzer.py
import duckdb

# Create a connection to an in-memory database
conn = duckdb.connect()

# Execute a simple query and fetch the result
result = conn.execute("SELECT 'Hello, world!' AS message").fetchall()
print(result)

This script creates an in-memory DuckDB connection and runs a basic SQL query. The result is returned as a list of tuples, with each tuple representing a row from the query result.

Now, let’s use the same connection in a second file called main.py:

main.py
from analyzer import conn

# Execute a simple query
result = conn.execute("SELECT 42 AS answer").fetchall()
print(result)

In this example, main.py reuses the conn object from analyzer.py to run a query. The result is fetched as a list of tuples. Separating the connection logic keeps your code clean and modular.

Once you save the file, execute the program using the following command:

 
python main.py

You should observe the following output:

Output
[('Hello, world!',)]
[(42,)]

Each query returns a list of tuples—DuckDB’s default format when using .fetchall(). This makes it easy to directly work with small datasets in Python without needing additional libraries.

Now that you can connect to DuckDB and run queries, you're ready to work with real data.

Working with external data sources in DuckDB

DuckDB excels at querying data from external sources without loading entire datasets into memory. Unlike traditional database systems that require importing data before working with it, DuckDB can directly query files in various formats.

This "zero-copy" approach offers significant advantages:

  • Reduced memory usage: DuckDB only loads the parts of the file your query needs.
  • No import/export overhead: You can query the data where it already lives.
  • Simplified workflows: You can run a single SQL query across multiple files or formats.

DuckDB supports many file formats, especially ones designed for analytics, like Parquet and CSV. Let’s look at how you can use these external sources.

To try this out, place a sample CSV file in your project directory. You can create one called data.csv with the following content:

data.csv
id,name,score
1,Alice,85
2,Bob,90
3,Charlie,78

Now, create a new file called query_csv.py and add this code:

query_csv.py
import duckdb

# Query data directly from a CSV file
result = duckdb.query("SELECT * FROM 'data.csv'").fetchall()
print(result)

This code runs a SQL query directly on the CSV file—no need to load the data into memory or convert it into another format.

You should see output like this:

Output
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]

Querying Parquet files

DuckDB also works great with Parquet, a popular big data and analytics format.

First, create a Parquet version of your existing CSV file to try it out.

Add the following to a new file called convert_to_parquet.py:

convert_to_parquet.py
import duckdb

# Convert CSV to Parquet
duckdb.query("COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT 'parquet')")

In this code, you use DuckDB to read the contents of data.csv and write them to a new file called data.parquet in Parquet format.

Run the script:

 
python convert_to_parquet.py

This creates a data.parquet file in your project directory.

Now create a new file called query_parquet.py and add the following code:

query_parquet.py
import duckdb

# Query data directly from a Parquet file
result = duckdb.query("SELECT * FROM 'data.parquet'").fetchall()
print(result)

Then run the script:

 
python query_parquet.py

The output will follow the same structure—a list of tuples:

Output
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]

This ability to query external files directly makes DuckDB so powerful for data analysis. You don’t need a separate database server; you can start working with your data immediately.

Exploring SQL features with external files

Now that you can query CSV and Parquet files directly, let's explore DuckDB's SQL capabilities. DuckDB is powerful because you can apply the full range of SQL operations to external files without importing them first. This means you can filter, sort, transform, and aggregate your data with just a few lines of code.

Let's build on our previous example with the data.csv file by starting with more useful queries. We'll progressively add more sophisticated SQL features to show what DuckDB can do.

Create a new file called explore_data.py and let's start with some practical operations:

Filtering rows

When analyzing data, you often need to focus on specific subsets that meet certain criteria. DuckDB makes this easy with SQL's WHERE clause.

explore_data.py
import duckdb

# Select only students with a score above 80
result = duckdb.query("SELECT * FROM 'data.csv' WHERE score > 80").fetchall()

print("Students with score > 80:")
print(result)

The WHERE clause acts like a filter, only returning rows that match your condition. You can use comparison operators like >, <, =, <=, >=, and <> (not equal), along with logical operators like AND, OR, and NOT to create more complex filters.

Run the file:

 
python explore_data.py
Output
Students with score > 80:
[(1, 'Alice', 85), (2, 'Bob', 90)]

The output shows only the students who scored above 80, excluding those who didn’t meet the condition.

Next, let’s sort the results to see the highest scores first.

Sorting results

Data analysis often requires organizing results in a specific order. SQL's ORDER BY clause lets you sort your data based on one or more columns.

Update your file to include this sorting code:

explore_data.py
import duckdb

# Select only students with a score above 80
result = duckdb.query("SELECT * FROM 'data.csv' WHERE score > 80").fetchall()

print("Students with score > 80:")
print(result)

#Sort students by score in descending order
sorted_result = duckdb.query("SELECT * FROM 'data.csv' ORDER BY score DESC").fetchall()
print("\nStudents sorted by score (highest first):")
print(sorted_result)

The ORDER BY clause sorts your results based on the specified column. By default, sorting is in ascending order (ASC), but you can use DESC for descending order as shown here. You can also sort by multiple columns by listing them separated by commas, like ORDER BY column1, column2 DESC.

Run the file to see the sorted results:

 
python explore_data.py
Output
# Previous output...

Students sorted by score (highest first):
[(2, 'Bob', 90), (1, 'Alice', 85), (3, 'Charlie', 78)]

Adding calculated columns

SQL makes it easy to create new columns using expressions. This lets you transform your data and generate new insights directly in your query.

Update your file with the following code to add a calculated column:

explore_data.py
import duckdb

...

# Add 10 bonus points to each score
bonus_result = duckdb.query("""
    SELECT
        name,
        score,
        score + 10 AS boosted_score
    FROM 'data.csv'
""").fetchall()
print("\nScores with 10 bonus points:")
print(bonus_result)

The AS keyword lets you name the result of an expression, creating a new column in your output. You can use arithmetic operators like +, -, *, and /, as well as many built-in functions. These calculated columns exist only in your query results and don't modify the original data file.

Rerun the file:

 
python explore_data.py
Output
# Previous output...

Scores with 10 bonus points:
[('Alice', 85, 95), ('Bob', 90, 100), ('Charlie', 78, 88)]

As you can see, the query adds 10 points to each student's score and returns it as a new column called boosted_score. The original score values stay unchanged—this transformation happens only in the query result.

Next, let’s take this further and assign letter grades based on each student's score.

Using CASE expressions

For more complex transformations, SQL provides the CASE expression. This works similarly to if-else statements in programming languages, allowing you to apply conditional logic in your queries.

Add this code to your file:

explore_data.py
import duckdb

# Previous code...

# Assign grades based on score
graded_result = duckdb.query("""
    SELECT
        name,
        score,
        CASE
            WHEN score >= 90 THEN 'A'
            WHEN score >= 80 THEN 'B'
            ELSE 'C'
        END AS grade
    FROM 'data.csv'
""").fetchall()
print("\nStudents with letter grades:")
print(graded_result)

The CASE expression evaluates conditions in order and returns the result for the first true condition. In this example, we convert numeric scores to letter grades based on thresholds. This is particularly useful for categorizing data or implementing business rules in your queries.

Run the file to see the letter grades:

 
python explore_data.py
Output
# Previous output...

Students with letter grades:
[('Alice', 85, 'B'), ('Bob', 90, 'A'), ('Charlie', 78, 'C')]

Now that you’ve learned how to transform and categorize your data, let’s look at how to summarize it using aggregate functions like AVG() and COUNT().

Aggregating data

SQL truly shines when it comes to summarizing data. Aggregate functions let you calculate statistics across multiple rows, giving you insights into your entire dataset or groups within it.

Add this aggregation code:

explore_data.py
import duckdb

# Previous code...

# Calculate summary statistics
stats_result = duckdb.query("""
    SELECT
        COUNT(*) AS student_count,
        AVG(score) AS average_score,
        MIN(score) AS lowest_score,
        MAX(score) AS highest_score
    FROM 'data.csv'
""").fetchall()
print("\nSummary statistics:")
print(stats_result)

# Count students by grade
grade_counts = duckdb.query("""
    SELECT
        CASE
            WHEN score >= 90 THEN 'A'
            WHEN score >= 80 THEN 'B'
            ELSE 'C'
        END AS grade,
        COUNT(*) AS count
    FROM 'data.csv'
    GROUP BY grade
    ORDER BY grade
""").fetchall()
print("\nCount of students by grade:")
print(grade_counts)

The first query uses aggregate functions like COUNT(), AVG(), MIN(), and MAX() to summarize the entire dataset. The second query combines CASE with GROUP BY to count students in each grade category. The GROUP BY clause divides your data into groups, and the aggregate functions then operate on each group separately.

Run the file again:

 
python explore_data.py
Output
# Previous output...

Summary statistics:
[(3, 84.33333333333333, 78, 90)]

Count of students by grade:
[('A', 1), ('B', 1), ('C', 1)]

Now that you can work with an external data source, you’ll move on to working with multiple data sources.

Working with multiple data sources

Up to this point, you’ve been working with just one data file. But in real-world analysis, it’s common to pull data from several sources. That’s where SQL joins come in—they let you combine related data easily. And with DuckDB, doing this is simple and efficient.

To try it out, let’s add another CSV file with course information that matches up with our student data. Create a new file named courses.csv and add the following content:

courses.csv
id,course,credits
1,Math,4
2,Science,3
3,History,3

Now, let's create a new file called join_data.py to demonstrate how to join these datasets:

join_data.py
import duckdb

# Query data from both CSV files and join them
join_result = duckdb.query("""
    SELECT
        s.id,
        s.name,
        c.course,
        s.score,
        c.credits
    FROM
        'data.csv' s
    JOIN
        'courses.csv' c ON s.id = c.id
""").fetchall()

print("Joined student and course data:")
print(join_result)

This query joins our student data with course information based on the matching id field. Note how we're using table aliases (s for students and c for courses) to make the query more readable.

Run the file:

 
python join_data.py
Output
Joined student and course data:
[(1, 'Alice', 'Math', 85, 4), (2, 'Bob', 'Science', 90, 3), (3, 'Charlie', 'History', 78, 3)]

The output shows a combined dataset with fields from both files. DuckDB has matched each student with their corresponding course based on the id field.

Aggregating data from multiple sources

Joins become even more powerful when combined with aggregation. Let's see how to calculate a weighted score for each student based on their course credits:

join_data.py
import duckdb

# Query data from both CSV files and join them
join_result = duckdb.query("""
    ...
""").fetchall()

print("Joined student and course data:")
print(join_result)

# Calculate weighted scores
weighted_result = duckdb.query("""
SELECT
s.name,
s.score,
c.credits,
s.score * c.credits AS weighted_score
FROM
'data.csv' s
JOIN
'courses.csv' c ON s.id = c.id
ORDER BY
weighted_score DESC
""").fetchall()
print("\nWeighted scores by course credits:")
print(weighted_result)

This example shows how you can join data from multiple sources and perform calculations using fields from both tables. The weighted score multiplies each student's score by the number of credits for their course.

Run the file again:

 
python join_data.py
Output
Joined student and course data:
[(1, 'Alice', 'Math', 85, 4), (2, 'Bob', 'Science', 90, 3), (3, 'Charlie', 'History', 78, 3)]

Weighted scores by course credits:
[('Alice', 85, 4, 340), ('Bob', 90, 3, 270), ('Charlie', 78, 3, 234)]

With DuckDB's ability to join data from multiple files, you can build sophisticated analyses without needing to import everything into a traditional database first. This makes DuckDB especially powerful for ad-hoc data exploration and analysis.

In the next section, we'll look at creating persistent tables and databases with DuckDB for more complex analytical workflows.

Creating persistent tables and databases

You've been working with external files and temporary in-memory databases—great for quick analysis. But when you need your data to stick around between sessions, it's time to create something more permanent. DuckDB makes this simple by allowing you to save databases to disk and create persistent tables.

To try this out, create a new file called persistent_db.py and add the following code:

persistent_db.py
import duckdb

# Connect to a persistent database file
# If the file doesn't exist, it will be created
db = duckdb.connect('school.duckdb')

# Create a students table
db.execute("""
    CREATE TABLE IF NOT EXISTS students (
        id INTEGER PRIMARY KEY,
        name VARCHAR,
        score INTEGER
    )
""")

# Insert data into the table
db.execute("""
    INSERT INTO students VALUES
        (1, 'Alice', 85),
        (2, 'Bob', 90),
        (3, 'Charlie', 78)
    ON CONFLICT (id) DO NOTHING
""")

# Query the table
result = db.execute("SELECT * FROM students").fetchall()
print("Students in persistent database:")
print(result)

# Don't forget to close the connection
db.close()

In this example, we're creating a connection to a file called school.duckdb. If this file doesn't exist, DuckDB will create it. The CREATE TABLE IF NOT EXISTS statement ensures we only create the table if it doesn't already exist, and the ON CONFLICT clause prevents duplicate entries if we run the script multiple times.

Run this script:

 
python persistent_db.py
Output
Students in persistent database:
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]

If you check your project directory, you'll see a new file called school.duckdb. This file contains your database with the students table.

Run the script again, and you'll notice the data doesn't get duplicated because of our ON CONFLICT clause.

Let's create another script to modify our persistent data:

update_db.py
import duckdb

# Connect to the existing database
db = duckdb.connect('school.duckdb')
db.execute("INSERT INTO students VALUES (4, 'David', 95)")
db.execute("UPDATE students SET score = 88 WHERE name = 'Alice'")
result = db.execute("SELECT * FROM students ORDER BY id").fetchall()
print("Updated students table:")
print(result)

db.close()

Run this script:

 
python update_db.py
Output
Updated students table:
[(1, 'Alice', 88), (2, 'Bob', 90), (3, 'Charlie', 78), (4, 'David', 95)]

The changes are now saved in the database file and will persist between sessions.

Creating persistent databases lets you build more structured and reliable data analysis workflows with DuckDB, all while keeping its fast, embedded nature.

With that in place, you’ll now explore how to integrate DuckDB with pandas and other popular Python data science libraries.

Integrating DuckDB with pandas

While DuckDB is powerful on its own, it works exceptionally well with pandas, Python's most popular data analysis library. This integration gives you the best of both worlds: DuckDB's fast SQL querying and pandas' rich data manipulation capabilities.

To get started, make sure you have pandas installed:

 
pip install pandas

Now, let's create a new file called duckdb_pandas.py to explore this integration:

duckdb_pandas.py
import duckdb
import pandas as pd

# Create a simple pandas DataFrame
students_df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': [88, 90, 78, 95]
})

print("Original pandas DataFrame:")
print(students_df)

In this example, you create a basic pandas DataFrame with some student data. It includes IDs, names, and scores.

Run this script to see the DataFrame:

 
python duckdb_pandas.py
Output
Original pandas DataFrame:
   id     name  score
0   1    Alice     88
1   2      Bob     90
2   3  Charlie     78
3   4    David     95

An interesting feature of DuckDB is its ability to run SQL queries directly on pandas DataFrames. This lets you take advantage of SQL’s flexibility while working with in-memory data:

duckdb_pandas.py
import duckdb
import pandas as pd

# Create a simple pandas DataFrame
students_df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': [88, 90, 78, 95]
})

print("Original pandas DataFrame:")
print(students_df)

# Query the DataFrame using SQL
result = duckdb.query("""
SELECT
name,
score,
CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 80 THEN 'B'
ELSE 'C'
END AS grade
FROM students_df
WHERE score > 85
ORDER BY score DESC
""").fetchdf()
print("\nFiltered data with grades (as DataFrame):")
print(result)

In the highlighted code, you're using DuckDB to write a SQL query that runs directly on the students_df DataFrame. The query filters out students who scored 85 or less, assigns them letter grades based on their scores, and sorts the result from highest to lowest score. The .fetchdf() method returns the result as a new pandas DataFrame.

Run the script to see the result:

 
python duckdb_pandas.py
Output
Original pandas DataFrame:
   id     name  score
0   1    Alice     88
1   2      Bob     90
2   3  Charlie     78
3   4    David     95

Filtered data with grades (as DataFrame):
    name  score grade
0  David     95     A
1    Bob     90     A
2  Alice     88     B

Notice how we use the fetchdf() method instead of fetchall() to get the result as a pandas DataFrame rather than a list of tuples. DuckDB automatically recognizes DataFrame variables in your SQL queries, so you can reference them directly.

DuckDB works smoothly with pandas, giving you a flexible environment for data analysis. You get the best of both worlds—SQL’s clarity for queries and pandas’ wide range of tools for data manipulation. This makes it easy to use the right tool at the right stage of your workflow.

Final thoughts

DuckDB makes it easy to run fast, SQL-based analysis right inside your Python projects—no server required. From querying files and DataFrames to building persistent databases, it fits smoothly into modern data workflows.

To go further, check out the official docs at duckdb.org.

Author's avatar
Article by
Stanley Ulili
Stanley Ulili is a technical educator at Better Stack based in Malawi. He specializes in backend development and has freelanced for platforms like DigitalOcean, LogRocket, and AppSignal. Stanley is passionate about making complex topics accessible to developers.
Got an article suggestion? Let us know
Next article
Structural Pattern Matching in Python: A Comprehensive Guide
Learn how to use structural pattern matching in Python to simplify complex conditionals and make your code more readable. This guide covers matching with basic types, dictionaries, guard clauses, and more—introduced in Python 3.10 and enhanced in Python 3.13. Includes clear examples and comparisons with traditional approaches.
Licensed under CC-BY-NC-SA

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Make your mark

Join the writer's program

Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.

Write for us
Writer of the month
Marin Bezhanov
Marin is a software engineer and architect with a broad range of experience working...
Build on top of Better Stack

Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.

community@betterstack.com

or submit a pull request and help us build better products for everyone.

See the full list of amazing projects on github