# DuckDB for Python: A beginner's guide

[DuckDB](https://duckdb.org/) is a fast, in-process analytical database built for modern data analysis. Unlike other databases, DuckDB runs directly inside your Python application, offering high-performance querying without needing a separate server.

Thanks to its columnar storage and vectorized execution, it's especially powerful for analytical workloads—often outperforming tools like SQLite for complex queries.

<iframe width="100%" height="315" src="https://www.youtube.com/embed/1Q_62NigzN8?si=NeJyaOftkA0uFEM0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

In this guide, you'll set up a Python-based data analysis workflow using DuckDB. You'll explore key features, learn how to use them effectively, and tailor your setup for performance and flexibility.

[ad-logs]

## Prerequisites

To follow along, make sure you have [Python](https://www.python.org/downloads/) 3.13 or newer installed. You should also be comfortable with basic SQL and data analysis in Python.

## Getting started with DuckDB

In this section, you’ll set up a Python environment for working with DuckDB and run your first queries.

Start by creating a new project directory and navigating into it:

```command
mkdir duckdb-analysis && cd duckdb-analysis
```

Then, set up a virtual environment:

```command
python3 -m venv venv
```

Activate the virtual environment according to your operating system:

```command
source venv/bin/activate
```

Now install the latest version of [DuckDB](https://pypi.org/project/duckdb/). This guide works with DuckDB version 1.2.x, current at the time of writing:

```command
pip install duckdb
```

Create a new `analyzer.py` file in the root of your project directory, and populate it with the following contents:

```python
[label analyzer.py]
import duckdb

# Create a connection to an in-memory database
conn = duckdb.connect()

# Execute a simple query and fetch the result
result = conn.execute("SELECT 'Hello, world!' AS message").fetchall()
print(result)
```

This script creates an in-memory DuckDB connection and runs a basic SQL query. The result is returned as a list of tuples, with each tuple representing a row from the query result.

Now, let’s use the same connection in a second file called `main.py`:

```python
[label main.py]
from analyzer import conn

# Execute a simple query
result = conn.execute("SELECT 42 AS answer").fetchall()
print(result)
```

In this example, `main.py` reuses the `conn` object from `analyzer.py` to run a query. The result is fetched as a list of tuples. Separating the connection logic keeps your code clean and modular.

Once you save the file, execute the program using the following command:

```command
python main.py
```

You should observe the following output:

```text
[output]
[('Hello, world!',)]
[(42,)]
```

Each query returns a list of tuples—DuckDB’s default format when using `.fetchall()`. This makes it easy to directly work with small datasets in Python without needing additional libraries.

Now that you can connect to DuckDB and run queries, you're ready to work with real data.

## Working with external data sources in DuckDB

DuckDB excels at querying data from external sources without loading entire datasets into memory. Unlike traditional database systems that require importing data before working with it, DuckDB can directly query files in various formats.

This "zero-copy" approach offers significant advantages:

- Reduced memory usage: DuckDB only loads the parts of the file your query needs.
- No import/export overhead: You can query the data where it already lives.
- Simplified workflows: You can run a single SQL query across multiple files or formats.

DuckDB supports many file formats, especially ones designed for analytics, like Parquet and CSV. Let’s look at how you can use these external sources.

To try this out, place a sample CSV file in your project directory. You can create one called `data.csv` with the following content:

```text
[label data.csv]
id,name,score
1,Alice,85
2,Bob,90
3,Charlie,78
```

Now, create a new file called `query_csv.py` and add this code:

```python
[label query_csv.py]
import duckdb

# Query data directly from a CSV file
result = duckdb.query("SELECT * FROM 'data.csv'").fetchall()
print(result)
```

This code runs a SQL query directly on the CSV file—no need to load the data into memory or convert it into another format.

You should see output like this:

```text
[output]
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]
```

### Querying Parquet files

DuckDB also works great with Parquet, a popular big data and analytics format.

First, create a Parquet version of your existing CSV file to try it out.

Add the following to a new file called `convert_to_parquet.py`:

```python
[label convert_to_parquet.py]
import duckdb

# Convert CSV to Parquet
duckdb.query("COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT 'parquet')")
```

In this code, you use DuckDB to read the contents of `data.csv` and write them to a new file called `data.parquet` in Parquet format.

Run the script:

```command
python convert_to_parquet.py
```

This creates a `data.parquet` file in your project directory.

Now create a new file called `query_parquet.py` and add the following code:

```python
[label query_parquet.py]
import duckdb

# Query data directly from a Parquet file
result = duckdb.query("SELECT * FROM 'data.parquet'").fetchall()
print(result)
```

Then run the script:

```command
python query_parquet.py
```

The output will follow the same structure—a list of tuples:

```text
[output]
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]
```

This ability to query external files directly makes DuckDB so powerful for data analysis. You don’t need a separate database server; you can start working with your data immediately.

## Exploring SQL features with external files

Now that you can query CSV and Parquet files directly, let's explore DuckDB's SQL capabilities. DuckDB is powerful because you can apply the full range of SQL operations to external files without importing them first. This means you can filter, sort, transform, and aggregate your data with just a few lines of code.

Let's build on our previous example with the data.csv file by starting with more useful queries. We'll progressively add more sophisticated SQL features to show what DuckDB can do.

Create a new file called `explore_data.py` and let's start with some practical operations:

### Filtering rows

When analyzing data, you often need to focus on specific subsets that meet certain criteria. DuckDB makes this easy with SQL's WHERE clause.

```python
[label explore_data.py]
import duckdb

# Select only students with a score above 80
result = duckdb.query("SELECT * FROM 'data.csv' WHERE score > 80").fetchall()

print("Students with score > 80:")
print(result)
```

The `WHERE` clause acts like a filter, only returning rows that match your condition. You can use comparison operators like `>`, `<`, `=`, `<=`, `>=`, and `<>` (not equal), along with logical operators like `AND`, `OR`, and `NOT` to create more complex filters.

Run the file:

```command
python explore_data.py
```

```text
[output]
Students with score > 80:
[(1, 'Alice', 85), (2, 'Bob', 90)]
```

The output shows only the students who scored above 80, excluding those who didn’t meet the condition.

Next, let’s sort the results to see the highest scores first.

### Sorting results

Data analysis often requires organizing results in a specific order. SQL's `ORDER BY` clause lets you sort your data based on one or more columns.

Update your file to include this sorting code:

```python
[label explore_data.py]
import duckdb

# Select only students with a score above 80
result = duckdb.query("SELECT * FROM 'data.csv' WHERE score > 80").fetchall()

print("Students with score > 80:")
print(result)

[highlight]
#Sort students by score in descending order
sorted_result = duckdb.query("SELECT * FROM 'data.csv' ORDER BY score DESC").fetchall()
print("\nStudents sorted by score (highest first):")
print(sorted_result)
[/highlight]
```

The `ORDER BY` clause sorts your results based on the specified column. By default, sorting is in ascending order (`ASC`), but you can use `DESC` for descending order as shown here. You can also sort by multiple columns by listing them separated by commas, like `ORDER BY column1, column2 DESC`.

Run the file to see the sorted results:

```command
python explore_data.py
```

```text
[output]
# Previous output...

Students sorted by score (highest first):
[(2, 'Bob', 90), (1, 'Alice', 85), (3, 'Charlie', 78)]
```

### Adding calculated columns

SQL makes it easy to create new columns using expressions. This lets you transform your data and generate new insights directly in your query.

Update your file with the following code to add a calculated column:

```python
[label explore_data.py]
import duckdb

...

# Add 10 bonus points to each score
bonus_result = duckdb.query("""
    SELECT
        name,
        score,
        score + 10 AS boosted_score
    FROM 'data.csv'
""").fetchall()
print("\nScores with 10 bonus points:")
print(bonus_result)
```

The `AS` keyword lets you name the result of an expression, creating a new column in your output. You can use arithmetic operators like `+`, `-`, `*`, and `/`, as well as many built-in functions. These calculated columns exist only in your query results and don't modify the original data file.

Rerun the file:

```command
python explore_data.py
```

```text
[output]
# Previous output...

Scores with 10 bonus points:
[('Alice', 85, 95), ('Bob', 90, 100), ('Charlie', 78, 88)]
```

As you can see, the query adds 10 points to each student's score and returns it as a new column called `boosted_score`. The original `score` values stay unchanged—this transformation happens only in the query result.

Next, let’s take this further and assign letter grades based on each student's score.

### Using CASE expressions

For more complex transformations, SQL provides the `CASE` expression. This works similarly to if-else statements in programming languages, allowing you to apply conditional logic in your queries.

Add this code to your file:

```python
[label explore_data.py]
import duckdb

# Previous code...

# Assign grades based on score
graded_result = duckdb.query("""
    SELECT
        name,
        score,
        CASE
            WHEN score >= 90 THEN 'A'
            WHEN score >= 80 THEN 'B'
            ELSE 'C'
        END AS grade
    FROM 'data.csv'
""").fetchall()
print("\nStudents with letter grades:")
print(graded_result)
```

The `CASE` expression evaluates conditions in order and returns the result for the first true condition. In this example, we convert numeric scores to letter grades based on thresholds. This is particularly useful for categorizing data or implementing business rules in your queries.

Run the file to see the letter grades:

```command
python explore_data.py
```

```text
[output]
# Previous output...

Students with letter grades:
[('Alice', 85, 'B'), ('Bob', 90, 'A'), ('Charlie', 78, 'C')]
```

Now that you’ve learned how to transform and categorize your data, let’s look at how to summarize it using aggregate functions like `AVG()` and `COUNT()`.

### Aggregating data

SQL truly shines when it comes to summarizing data. Aggregate functions let you calculate statistics across multiple rows, giving you insights into your entire dataset or groups within it.

Add this aggregation code:

```python
[label explore_data.py]
import duckdb

# Previous code...

# Calculate summary statistics
stats_result = duckdb.query("""
    SELECT
        COUNT(*) AS student_count,
        AVG(score) AS average_score,
        MIN(score) AS lowest_score,
        MAX(score) AS highest_score
    FROM 'data.csv'
""").fetchall()
print("\nSummary statistics:")
print(stats_result)

# Count students by grade
grade_counts = duckdb.query("""
    SELECT
        CASE
            WHEN score >= 90 THEN 'A'
            WHEN score >= 80 THEN 'B'
            ELSE 'C'
        END AS grade,
        COUNT(*) AS count
    FROM 'data.csv'
    GROUP BY grade
    ORDER BY grade
""").fetchall()
print("\nCount of students by grade:")
print(grade_counts)
```

The first query uses aggregate functions like `COUNT()`, `AVG()`, `MIN()`, and `MAX()` to summarize the entire dataset. The second query combines `CASE` with `GROUP BY` to count students in each grade category. The `GROUP BY` clause divides your data into groups, and the aggregate functions then operate on each group separately.

Run the file again:

```command
python explore_data.py
```

```text
[output]
# Previous output...

Summary statistics:
[(3, 84.33333333333333, 78, 90)]

Count of students by grade:
[('A', 1), ('B', 1), ('C', 1)]
```

Now that you can work with an external data source, you’ll move on to working with multiple data sources.

## Working with multiple data sources

Up to this point, you’ve been working with just one data file. But in real-world analysis, it’s common to pull data from several sources. That’s where SQL joins come in—they let you combine related data easily. And with DuckDB, doing this is simple and efficient.

To try it out, let’s add another CSV file with course information that matches up with our student data. Create a new file named `courses.csv` and add the following content:

```csv
[label courses.csv]
id,course,credits
1,Math,4
2,Science,3
3,History,3
```

Now, let's create a new file called `join_data.py` to demonstrate how to join these datasets:

```python
[label join_data.py]
import duckdb

# Query data from both CSV files and join them
join_result = duckdb.query("""
    SELECT
        s.id,
        s.name,
        c.course,
        s.score,
        c.credits
    FROM
        'data.csv' s
    JOIN
        'courses.csv' c ON s.id = c.id
""").fetchall()

print("Joined student and course data:")
print(join_result)
```

This query joins our student data with course information based on the matching `id` field. Note how we're using table aliases (`s` for students and `c` for courses) to make the query more readable.

Run the file:

```command
python join_data.py
```

```text
[output]
Joined student and course data:
[(1, 'Alice', 'Math', 85, 4), (2, 'Bob', 'Science', 90, 3), (3, 'Charlie', 'History', 78, 3)]
```

The output shows a combined dataset with fields from both files. DuckDB has matched each student with their corresponding course based on the `id` field.

### Aggregating data from multiple sources

Joins become even more powerful when combined with aggregation. Let's see how to calculate a weighted score for each student based on their course credits:

```python
[label join_data.py]
import duckdb

# Query data from both CSV files and join them
join_result = duckdb.query("""
    ...
""").fetchall()

print("Joined student and course data:")
print(join_result)

[highlight]
# Calculate weighted scores
weighted_result = duckdb.query("""
    SELECT
        s.name,
        s.score,
        c.credits,
        s.score * c.credits AS weighted_score
    FROM
        'data.csv' s
    JOIN
        'courses.csv' c ON s.id = c.id
    ORDER BY
        weighted_score DESC
""").fetchall()

print("\nWeighted scores by course credits:")
print(weighted_result)
[/highlight]
```

This example shows how you can join data from multiple sources and perform calculations using fields from both tables. The weighted score multiplies each student's score by the number of credits for their course.

Run the file again:

```command
python join_data.py
```

```text
[output]
Joined student and course data:
[(1, 'Alice', 'Math', 85, 4), (2, 'Bob', 'Science', 90, 3), (3, 'Charlie', 'History', 78, 3)]

Weighted scores by course credits:
[('Alice', 85, 4, 340), ('Bob', 90, 3, 270), ('Charlie', 78, 3, 234)]
```

With DuckDB's ability to join data from multiple files, you can build sophisticated analyses without needing to import everything into a traditional database first. This makes DuckDB especially powerful for ad-hoc data exploration and analysis.

In the next section, we'll look at creating persistent tables and databases with DuckDB for more complex analytical workflows.

## Creating persistent tables and databases

You've been working with external files and temporary in-memory databases—great for quick analysis. But when you need your data to stick around between sessions, it's time to create something more permanent. DuckDB makes this simple by allowing you to save databases to disk and create persistent tables.

To try this out, create a new file called `persistent_db.py` and add the following code:

```python
[label persistent_db.py]
import duckdb

# Connect to a persistent database file
# If the file doesn't exist, it will be created
db = duckdb.connect('school.duckdb')

# Create a students table
db.execute("""
    CREATE TABLE IF NOT EXISTS students (
        id INTEGER PRIMARY KEY,
        name VARCHAR,
        score INTEGER
    )
""")

# Insert data into the table
db.execute("""
    INSERT INTO students VALUES
        (1, 'Alice', 85),
        (2, 'Bob', 90),
        (3, 'Charlie', 78)
    ON CONFLICT (id) DO NOTHING
""")

# Query the table
result = db.execute("SELECT * FROM students").fetchall()
print("Students in persistent database:")
print(result)

# Don't forget to close the connection
db.close()
```

In this example, we're creating a connection to a file called `school.duckdb`. If this file doesn't exist, DuckDB will create it. The `CREATE TABLE IF NOT EXISTS` statement ensures we only create the table if it doesn't already exist, and the `ON CONFLICT` clause prevents duplicate entries if we run the script multiple times.

Run this script:

```command
python persistent_db.py
```

```text
[output]
Students in persistent database:
[(1, 'Alice', 85), (2, 'Bob', 90), (3, 'Charlie', 78)]
```

If you check your project directory, you'll see a new file called `school.duckdb`. This file contains your database with the students table.

Run the script again, and you'll notice the data doesn't get duplicated because of our `ON CONFLICT` clause.

Let's create another script to modify our persistent data:

```python
[label update_db.py]
import duckdb

# Connect to the existing database
db = duckdb.connect('school.duckdb')
db.execute("INSERT INTO students VALUES (4, 'David', 95)")
db.execute("UPDATE students SET score = 88 WHERE name = 'Alice'")
result = db.execute("SELECT * FROM students ORDER BY id").fetchall()
print("Updated students table:")
print(result)

db.close()
```

Run this script:

```command
python update_db.py
```

```text
[output]
Updated students table:
[(1, 'Alice', 88), (2, 'Bob', 90), (3, 'Charlie', 78), (4, 'David', 95)]
```

The changes are now saved in the database file and will persist between sessions.

Creating persistent databases lets you build more structured and reliable data analysis workflows with DuckDB, all while keeping its fast, embedded nature.

With that in place, you’ll now explore how to integrate DuckDB with pandas and other popular Python data science libraries.

## Integrating DuckDB with pandas

While DuckDB is powerful on its own, it works exceptionally well with pandas, Python's most popular data analysis library. This integration gives you the best of both worlds: DuckDB's fast SQL querying and pandas' rich data manipulation capabilities.

To get started, make sure you have pandas installed:

```command
pip install pandas
```

Now, let's create a new file called `duckdb_pandas.py` to explore this integration:

```python
[label duckdb_pandas.py]
import duckdb
import pandas as pd

# Create a simple pandas DataFrame
students_df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': [88, 90, 78, 95]
})

print("Original pandas DataFrame:")
print(students_df)
```

In this example, you create a basic pandas DataFrame with some student data. It includes IDs, names, and scores.

Run this script to see the DataFrame:

```command
python duckdb_pandas.py
```

```text
[output]
Original pandas DataFrame:
   id     name  score
0   1    Alice     88
1   2      Bob     90
2   3  Charlie     78
3   4    David     95
```

An interesting feature of DuckDB is its ability to run SQL queries directly on pandas DataFrames. This lets you take advantage of SQL’s flexibility while working with in-memory data:

```python
[label duckdb_pandas.py]
import duckdb
import pandas as pd

# Create a simple pandas DataFrame
students_df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': [88, 90, 78, 95]
})

print("Original pandas DataFrame:")
print(students_df)

[highlight]
# Query the DataFrame using SQL
result = duckdb.query("""
    SELECT
        name,
        score,
        CASE
            WHEN score >= 90 THEN 'A'
            WHEN score >= 80 THEN 'B'
            ELSE 'C'
        END AS grade
    FROM students_df
    WHERE score > 85
    ORDER BY score DESC
""").fetchdf()

print("\nFiltered data with grades (as DataFrame):")
print(result)
[/highlight]
```

In the highlighted code, you're using DuckDB to write a SQL query that runs directly on the `students_df` DataFrame. The query filters out students who scored 85 or less, assigns them letter grades based on their scores, and sorts the result from highest to lowest score. The `.fetchdf()` method returns the result as a new pandas DataFrame.

Run the script to see the result:

```command
python duckdb_pandas.py
```

```text
[output]
Original pandas DataFrame:
   id     name  score
0   1    Alice     88
1   2      Bob     90
2   3  Charlie     78
3   4    David     95

Filtered data with grades (as DataFrame):
    name  score grade
0  David     95     A
1    Bob     90     A
2  Alice     88     B
```

Notice how we use the `fetchdf()` method instead of `fetchall()` to get the result as a pandas DataFrame rather than a list of tuples. DuckDB automatically recognizes DataFrame variables in your SQL queries, so you can reference them directly.

DuckDB works smoothly with pandas, giving you a flexible environment for data analysis. You get the best of both worlds—SQL’s clarity for queries and pandas’ wide range of tools for data manipulation. This makes it easy to use the right tool at the right stage of your workflow.

## Final thoughts

DuckDB makes it easy to run fast, SQL-based analysis right inside your Python projects—no server required. From querying files and DataFrames to building persistent databases, it fits smoothly into modern data workflows.

To go further, check out the official docs at [duckdb.org](https://duckdb.org/).
