Back to Scaling Python Applications guides

Working with CSV Files in Python

Stanley Ulili

Updated on April 30, 2025

Prerequisites
Getting started with CSV in Python
Reading CSV files in Python
Working with CSV data as dictionaries
Writing data to CSV files
Handling CSV dialects and formats
Validating and cleaning CSV data
Final thoughts

CSV (Comma-Separated Values) files are the backbone of data exchange. Their simple, tabular structure makes them perfect for data analysis, configuration storage, and sharing information between systems.

Python provides efficient tools for working with CSV data through its standard library and specialized packages.

This guide will show you how to handle CSV files in Python, from basic reading to advanced data manipulation.

Prerequisites

Make sure you have Python 3.7 or later installed. You should know Python basics like lists, dictionaries, and file operations to get the most from this guide.

Getting started with CSV in Python

Python's csv module has everything you need to process CSV files. It's part of the standard library, so you don't need to install any additional software.

As shown in the diagram below, working with CSV files in Python follows a simple flow:

Diagram showing CSV data flow through Python processing: input files are processed by Python's CSV module or pandas library, then output as analysis results or transformed data.

This workflow makes Python an excellent choice for CSV processing, whether you're doing simple data transformations or complex analysis.

To begin, create a project directory:

Copied!

mkdir csv-python && cd csv-python

Next, set up a virtual environment:

Copied!

python3 -m venv venv

Activate the virtual environment:

Copied!

source venv/bin/activate

Now, create a file named app.py that you'll use throughout this tutorial.

Reading CSV files in Python

Reading data from CSV files is often your first step in data analysis. The csv module makes this straightforward.

First, create a sample CSV file named employees.csv with this data:

employees.csv

Copied!

id,name,department,salary
1,John Smith,Engineering,75000
2,Sara Johnson,Marketing,82000
3,Michael Brown,Finance,95000
4,Jessica Williams,Human Resources,67000
5,Robert Davis,Engineering,78000

Now add this code to your app.py file:

app.py

Copied!

import csv

# Reading a CSV file using csv.reader
with open('employees.csv', 'r', newline='') as file:
    csv_reader = csv.reader(file)

    # The first row typically contains headers
    headers = next(csv_reader)
    print(f"Headers: {headers}")

    # Print each row of data
    for row in csv_reader:
        print(f"Row: {row}")
        print(f"  Employee: {row[1]} works in {row[2]} and earns ${row[3]}")

The csv.reader() function gives you an iterator that returns each row as a list of strings. The next(csv_reader) line extracts the header row before processing the data rows.

Run your script:

Copied!

python app.py

Output

Headers: ['id', 'name', 'department', 'salary']
Row: ['1', 'John Smith', 'Engineering', '75000']
  Employee: John Smith works in Engineering and earns $75000
Row: ['2', 'Sara Johnson', 'Marketing', '82000']
  Employee: Sara Johnson works in Marketing and earns $82000
...
Row: ['5', 'Robert Davis', 'Engineering', '78000']
  Employee: Robert Davis works in Engineering and earns $78000

Notice the newline='' parameter when opening the file—this prevents issues with different line endings across operating systems.

Working with CSV data as dictionaries

While lists work for simple files, accessing data by column name instead of index position makes your code more readable. The csv.DictReader class does this by mapping each row to a dictionary using the header row as keys.

Side-by-side comparison of CSV reading methods in Python. Left: `csv.reader` converts rows to lists with numeric index access. Right: `csv.DictReader` converts rows to dictionaries with named field access.

Update your app.py with this code:

app.py

Copied!

import csv

# Reading a CSV file using csv.reader
with open('employees.csv', 'r', newline='') as file:
    csv_reader = csv.reader(file)

    # The first row typically contains headers
    headers = next(csv_reader)
    print(f"Headers: {headers}")

    # Print each row of data
    for row in csv_reader:
        print(f"Row: {row}")
        print(f"  Employee: {row[1]} works in {row[2]} and earns ${row[3]}")

# Reading a CSV file using csv.DictReader
with open('employees.csv', 'r', newline='') as file:
    dict_reader = csv.DictReader(file)

    # Print the fieldnames (headers)
    print(f"\nField names: {dict_reader.fieldnames}")

    # Process each row as a dictionary
    for row in dict_reader:
        print(f"Row: {row}")
        print(f"  {row['name']} works in {row['department']} and earns ${row['salary']}")

        # Perform calculations or conditional operations directly with column names
        if int(row['salary']) > 80000:
            print(f"  {row['name']} is a high earner!")

Using DictReader gives you several advantages:

You access data by meaningful column names instead of numeric indices
Your code becomes more resilient to changes in column order
Your logic reflects the actual structure of your data
Extracting specific columns becomes more intuitive

Run the file again:

Copied!

python app.py

Output

Headers: ['id', 'name', 'department', 'salary']
...

Field names: ['id', 'name', 'department', 'salary']
Row: {'id': '1', 'name': 'John Smith', 'department': 'Engineering', 'salary': '75000'}
  John Smith works in Engineering and earns $75000
Row: {'id': '2', 'name': 'Sara Johnson', 'department': 'Marketing', 'salary': '82000'}
  Sara Johnson works in Marketing and earns $82000
  Sara Johnson is a high earner!
...
 Jessica Williams works in Human Resources and earns $67000
Row: {'id': '5', 'name': 'Robert Davis', 'department': 'Engineering', 'salary': '78000'}
  Robert Davis works in Engineering and earns $78000

This name-based access makes your code more readable and less prone to errors when column positions change.

Writing data to CSV files

Creating CSV files is just as important as reading them. Python's csv module gives you two approaches: csv.writer for list-based data and csv.DictWriter for dictionary-based data.

Update your app.py with this code:

app.py

Copied!

import csv

# Sample data to write to CSV
employees = [
    {'id': 101, 'name': 'Alice Cooper', 'dept': 'Engineering', 'salary': 85000},
    {'id': 102, 'name': 'Bob Martinez', 'dept': 'Marketing', 'salary': 72000},
    {'id': 103, 'name': 'Charlie Lee', 'dept': 'Finance', 'salary': 93000}
]

# Using csv.writer (list-based approach)
with open('team.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['ID', 'Name', 'Department', 'Salary'])

    for emp in employees:
        writer.writerow([emp['id'], emp['name'], emp['dept'], emp['salary']])

print("CSV written successfully!")

In this code, you create a sample dataset of employees, each represented as a dictionary containing their id, name, department, and salary.

The code uses csv.writer to write this data to a CSV file in list format. You first write a header row and then loop through the employee data to write each row as a list of values.

Run the script:

Copied!

python app.py

Output

CSV written successfully!

Check the output file:

Copied!

cat team.csv

Output

ID,Name,Department,Salary
101,Alice Cooper,Engineering,85000
102,Bob Martinez,Marketing,72000
103,Charlie Lee,Finance,93000

Alternatively, you can use csv.DictWriter for a more dictionary-oriented approach:

Copied!

# Using csv.DictWriter (dictionary-based approach)
with open('team_dict.csv', 'w', newline='') as file:
    fields = ['id', 'name', 'dept', 'salary', 'category']
    writer = csv.DictWriter(file, fieldnames=fields)
    writer.writeheader()

    for emp in employees:
        # Add a calculated field
        category = 'Senior' if emp['salary'] >= 80000 else 'Junior'
        # Create a row with the additional field
        row = {**emp, 'category': category}
        writer.writerow(row)

The key benefits of DictWriter include:

It generates headers automatically with writeheader()
It maintains field order with the fieldnames parameter
You can easily add calculated fields
You can write dictionaries directly without manual ordering

Handling CSV dialects and formats

CSV isn't a standardized format. Different systems use variations in delimiters, quoting, and escape characters. Python's csv module handles these through "dialects."

Try these format variations:

app.py

Copied!

import csv

# Sample data
sample_data = [
    ['Product', 'Price', 'Description'],
    ['Laptop', '999.99', 'High-performance laptop'],
    ['Keyboard', '87.99', 'Mechanical keyboard with "Cherry MX" switches']
]

# Standard CSV (comma-separated)
with open('products_comma.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(sample_data)

# Tab-separated values (TSV)
with open('products_tab.csv', 'w', newline='') as file:
    writer = csv.writer(file, delimiter='\t')
    writer.writerows(sample_data)

print("Created comma and tab-separated files")

This code creates two different CSV files from the same data. The first uses the standard comma delimiter, while the second uses tabs instead. Notice how you only need to change the delimiter parameter to switch formats.

Run the script:

Copied!

python app.py

Output

Created comma and tab-separated files

You can compare the two formats:

Copied!

cat products_comma.csv

Output

Product,Price,Description
Laptop,999.99,High-performance laptop
Keyboard,87.99,"Mechanical keyboard with ""Cherry MX"" switches"

Copied!

cat products_tab.csv

Output

Product Price   Description
Laptop  999.99  High-performance laptop
Keyboard        87.99   "Mechanical keyboard with ""Cherry MX"" switches"

Notice how quotes are handled differently in the two formats. The comma-separated format needs to escape the quotes in "Cherry MX" with double quotes, while the tab-separated format doesn't have this requirement.

For other common formats, modify dialect parameters as needed:

Copied!

# Semicolon-separated (common in European locales)
writer = csv.writer(file, delimiter=';')

# Custom dialect with different quoting behavior
csv.register_dialect('custom', 
                     delimiter=',',
                     quotechar='\'',
                     quoting=csv.QUOTE_NONNUMERIC)
writer = csv.writer(file, dialect='custom')

The first part of the code shows how to create a semicolon-separated file, common in countries that use commas for decimal points.

The second creates a custom dialect that uses single quotes and applies them to all non-numeric fields.

Key dialect parameters include:

Parameter	Description	Common Values
delimiter	Field separator	`,` (default), `;`, `\t` (tab)
quotechar	Quoting character	`"` (default), `'`
quoting	Quoting style	`csv.QUOTE_MINIMAL`, `csv.QUOTE_ALL`
escapechar	Character for escaping	`\`

When working with unknown CSV formats, use csv.Sniffer to detect the dialect automatically:

Copied!

with open('unknown_format.csv', 'r', newline='') as file:
    sample = file.read(1024)
    file.seek(0)
    dialect = csv.Sniffer().sniff(sample)
    reader = csv.reader(file, dialect)
    # Now process the file with the detected dialect

This code reads a small sample of the file, uses csv.Sniffer().sniff() to analyze its format, then creates a reader with the detected dialect. This is especially useful when processing files from various sources where you don't know the exact format in advance.

These parameters help you handle almost any CSV variation you encounter.

Validating and cleaning CSV data

Real-world CSV files often contain inconsistencies, errors, and missing values. While you could write validation code from scratch, Python provides specialized tools that simplify this process significantly.

Pandas is a powerful data analysis library that excels at handling tabular data like CSV files. Before proceeding, you'll need to install it:

Copied!

pip install pandas

Now, create a new file named validation.py to check your team.csv file using pandas for validation:

validation.py

Copied!

import pandas as pd

# Load CSV into a DataFrame - pandas' core data structure
# The dtype parameter validates data types during import
df = pd.read_csv('team.csv', 
                 dtype={
                     'ID': int,
                     'Name': str,
                     'Department': str,
                     'Salary': float
                 })

# Display the first few rows to check the data
print("CSV data preview:")
print(df.head())

# Check for missing values
missing_values = df.isnull().sum()

# Validate salary range
invalid_salaries = df[df['Salary'] <= 0]

# Validate name field (not empty)
empty_names = df[df['Name'].str.strip() == '']

# Print validation results
print(f"\nMissing values per column:\n{missing_values}")
print(f"\nInvalid salaries: {len(invalid_salaries)}")
print(f"Empty names: {len(empty_names)}")

This code demonstrates why pandas is the preferred tool for CSV validation.

Its DataFrame structure automatically handles column types, provides powerful filtering for validation rules, and makes it easy to identify problematic data.

The dtype parameter validates data types while loading the file, immediately catching conversion errors.

Run the script:

Copied!

python validation.py

Output

CSV data preview:
    ID          Name   Department   Salary
0  101  Alice Cooper  Engineering  85000.0
1  102  Bob Martinez    Marketing  72000.0
2  103   Charlie Lee      Finance  93000.0

Missing values per column:
ID            0
Name          0
Department    0
Salary        0
dtype: int64

Invalid salaries: 0
Empty names: 0

Pandas offers many additional validation capabilities:

Copied!

# Check for duplicate IDs
print(f"Duplicate IDs: {df['ID'].duplicated().sum()}")

You can detect duplicate values in columns that should be unique, like ID fields. The duplicated() method flags repeat values while sum() counts how many duplicates exist.

Copied!

# Validate department values against allowed list
valid_depts = ['Engineering', 'Marketing', 'Finance', 'HR']
invalid_depts = df[~df['Department'].isin(valid_depts)]
print(f"Invalid departments: {len(invalid_depts)}")

For categorical data, you'll often need to verify values against an allowed list. Here, the isin() method checks if each department appears in the valid options list, while the tilde (~) operator finds values that don't match your approved categories.

Copied!

# Statistical validation (detect outliers)
salary_mean = df['Salary'].mean()
salary_std = df['Salary'].std()
salary_outliers = df[(df['Salary'] < salary_mean - 2*salary_std) | 
                     (df['Salary'] > salary_mean + 2*salary_std)]
print(f"Salary outliers: {len(salary_outliers)}")

Statistical validation helps identify questionable values that technically pass type checking. This approach identifies salary outliers using the standard "two-sigma rule" - values more than two standard deviations from the mean are flagged for review.

Final thoughts

In this guide, you learned how to handle CSV files in Python, from basic reading and writing to advanced data manipulation. Python's csv module provides tools to process data in both list and dictionary formats, offering flexibility for different use cases.

You also explored handling various CSV formats through dialects and performing data validation and cleaning using pandas. These powerful tools allow you to efficiently manage and manipulate CSV data for a wide range of applications.

For more details, refer to the official Python documentation on the csv module:
Python CSV Documentation and Pandas Documentation.

Got an article suggestion? Let us know

Working with YAML Files in Python

Learn how to read, write, and validate YAML files in Python using PyYAML and PyKwalify. This guide covers parsing, modifying, and structuring configuration data for real-world applications.

→