Back to Scaling Python Applications guides

Python Pandas: Working with CSV Files

Stanley Ulili
Updated on June 24, 2025

Pandas is a powerful Python library for working with structured data. It's widely used for data analysis and makes handling CSV files easy with built-in tools for reading, writing, and processing.

With flexible import/export options, strong data cleaning tools, and customizable transformations, Pandas is ideal for all kinds of data tasks.

This guide will show you how to work with CSV files using Pandas and set it up for your data processing needs.

Prerequisites

Before starting this guide, make sure you have a recent version of Python and pip installed.

Getting started with Pandas

To get the most out of this tutorial, it's best to work in a fresh Python environment where you can test the examples. Start by creating a new project folder and setting it up with these commands:

 
mkdir pandas-csv-tutorial && cd pandas-csv-tutorial
 
python3 -m venv venv

Now activate your virtual environment and install what you need:

 
source venv/bin/activate

Next, install the latest version of pandas using the command below:

 
pip install pandas

Create a new main.py file in your project's root directory and add this content:

main.py
import pandas as pd

# Create a simple DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

print(df)

This code imports the pandas library using the standard pd alias and creates a basic DataFrame from a Python dictionary.

You'll learn different ways to customize DataFrame creation and CSV operations later in this guide, but for now, run your first program:

 
python main.py

You should see this output:

Output
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

The output shows a clean table format that demonstrates how pandas organizes data in rows and columns. The DataFrame includes automatically created row indices (0, 1, 2) and clearly labeled columns that match your original dictionary keys.

Reading CSV files with Pandas

The pd.read_csv() function is your main tool for importing CSV data into pandas DataFrames. This function has many parameters that allow you to handle various CSV formats and data quality issues commonly found in real-world datasets.

Start by creating a sample CSV file to work with. Create a file called sample_data.csv with this content:

sample_data.csv
Name,Age,City,Salary
Alice Johnson,28,New York,75000
Bob Smith,32,Los Angeles,82000
Charlie Brown,29,Chicago,68000
Diana Prince,31,Houston,79000
Eve Adams,27,Phoenix,71000

Now read this CSV file using the simplest approach:

main.py
import pandas as pd

# Read CSV file
df = pd.read_csv('sample_data.csv')
print(df)
print(f"\nDataFrame shape: {df.shape}")
print(f"Column names: {list(df.columns)}")

The read_csv() function automatically found the column headers from the first row and figured out the right data types for each column.

The shape attribute tells you that your DataFrame has 5 rows and 4 columns, while the columns attribute gives you the column names.

When you run this code, you'll see this output:

Output
            Name  Age         City  Salary
0  Alice Johnson   28     New York   75000
1      Bob Smith   32  Los Angeles   82000
2  Charlie Brown   29      Chicago   68000
3   Diana Prince   31      Houston   79000
4      Eve Adams   27      Phoenix   71000

DataFrame shape: (5, 4)
Column names: ['Name', 'Age', 'City', 'Salary

The output shows a neatly formatted table with 5 rows and 4 columns: Name, Age, City, and Salary. The shape confirms the size (5, 4), and columns lists the column names.

When you run the code, the result will look clean and structured, just like in the screenshot below:

Screenshot 2025-06-24 at 3.55.13 PM.png

Handling different CSV formats

Real-world CSV files often don't follow standard formatting. Pandas gives you many options to handle these variations. Look at this example with a differently formatted CSV:

messy_data.csv
# Employee Data Export
# Generated on 2025-06-24
Name;Age;City;Salary;Department
Alice Johnson;28;New York;75000;Engineering
Bob Smith;32;Los Angeles;82000;Marketing
Charlie Brown;29;Chicago;68000;Sales

To read this file correctly, you need to specify extra parameters:

main.py
import pandas as pd

# Read CSV with custom parameters
df = pd.read_csv(
    'messy_data.csv',
    sep=';',           # Use semicolon as separator
    skiprows=2,        # Skip the first 2 comment lines
    encoding='utf-8'   # Specify encoding
)

print(df)

In this code, you're using pandas.read_csv() with extra options to handle a non-standard CSV file. The file uses semicolons instead of commas, and has two comment lines at the top.

  • sep=';' tells Pandas to use semicolons as column separators.
  • skiprows=2 skips the first two lines (comments).
  • encoding='utf-8' ensures proper character decoding.

The output is a clean DataFrame with correctly parsed columns:

Output
            Name  Age         City  Salary   Department
0  Alice Johnson   28     New York   75000  Engineering
1      Bob Smith   32  Los Angeles   82000    Marketing
2  Charlie Brown   29      Chicago   68000        Sales

Even with unusual formatting, Pandas lets you load the data cleanly and accurately.

Controlling data types during import

Pandas automatically guesses data types, but sometimes you need to control exactly how columns get interpreted. This is important for keeping your data accurate and using memory efficiently:

main.py
import pandas as pd

# Specify data types explicitly
dtype_dict = {
    'Name': 'string',
    'Age': 'int32',
    'Salary': 'float64'
}

df = pd.read_csv('sample_data.csv', dtype=dtype_dict)
print(df.dtypes)
print(f"\nMemory usage:\n{df.memory_usage(deep=True)}")
 
Name      string[python]
Age                int32
City              object
Salary           float64
dtype: object

Memory usage:
Index     132
Name      301
Age        20
City      285
Salary     40
dtype: int64

When you specify the right data types, you can significantly reduce memory usage, especially when working with large datasets. The string dtype uses less memory than the default object type for text data.

Processing and cleaning CSV data

Once you've imported your CSV data, the next step is cleaning and preprocessing to ensure good data quality. Real-world datasets often have inconsistencies, missing values, and formatting issues that need fixing before analysis.

Handling missing values

Missing data is one of the most common problems in data processing. Pandas gives you several ways to find and deal with missing values.

First, let's create a sample dataset with missing values to see how Pandas handles them:

main.py
import pandas as pd
import numpy as np

# Create sample data with missing values
data = {
    'Name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'Age': [28, 32, 29, None, 27],
    'City': ['New York', None, 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [75000, 82000, np.nan, 79000, 71000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print(f"\nMissing values per column:\n{df.isnull().sum()}")

This code creates a DataFrame with intentional missing values (None and np.nan) across different columns. The isnull().sum() method counts missing values in each column.

Output
Original DataFrame:
    Name   Age      City   Salary
0  Alice  28.0  New York  75000.0
1    Bob  32.0      None  82000.0
2   None  29.0   Chicago      NaN
3  Diana   NaN   Houston  79000.0
4    Eve  27.0   Phoenix  71000.0

Missing values per column:
Name      1
Age       1
City      1
Salary    1
dtype: int64

You can deal with missing values in different ways depending on your data and analysis needs. Here are two common strategies:

main.py
...
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print(f"\nMissing values per column:\n{df.isnull().sum()}")

import pandas as pd
# Strategy 1: Drop rows with any missing values
df_dropped = df.dropna()
print("After dropping rows with missing values:")
print(df_dropped)
# Strategy 2: Fill missing values with defaults
df_filled = df.fillna({
'Name': 'Unknown',
'Age': df['Age'].median(),
'City': 'Not Specified',
'Salary': df['Salary'].mean()
})
print(f"\nAfter filling missing values:")
print(df_filled)

The first strategy removes any rows that have missing values, while the second fills missing values with sensible defaults like medians, means, or placeholder text.

Output
Original DataFrame:
...
Missing values per column:
Name      1
Age       1
City      1
Salary    1
dtype: int64
After dropping rows with missing values:
    Name   Age      City   Salary
0  Alice  28.0  New York  75000.0
4    Eve  27.0   Phoenix  71000.0

After filling missing values:
      Name   Age           City   Salary
0    Alice  28.0       New York  75000.0
1      Bob  32.0  Not Specified  82000.0
2  Unknown  29.0        Chicago  76750.0
3    Diana  28.5        Houston  79000.0
4      Eve  27.0        Phoenix  71000.0

You can handle missing values in different ways, depending on your needs.

In this code, two common strategies are shown:

  • Strategy 1: dropna() removes any rows with missing values.
  • Strategy 2: fillna() replaces missing values with sensible defaults—like the median age, mean salary, or placeholder text.

The result is a cleaner DataFrame, whether you choose to drop or fill the gaps.

You’ll see the output formatted clearly, just like in the screenshot below:

Screenshot 2025-06-24 at 4.09.28 PM.png

Data transformation and normalization

Data transformation often means standardizing formats, converting data types, and creating new columns for analysis. This helps you derive more insights from your existing data.

Let's add some useful calculated columns to our employee dataset:

main.py
import pandas as pd

# Read our sample data
df = pd.read_csv('sample_data.csv')

# Transform and enhance the data
df['Name_Length'] = df['Name'].str.len()
df['Salary_Category'] = pd.cut(df['Salary'], 
                              bins=[0, 70000, 80000, float('inf')], 
                              labels=['Low', 'Medium', 'High'])
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Experienced')

print("Transformed DataFrame:")
print(df[['Name', 'Age', 'Salary', 'Age_Group', 'Salary_Category']])

This code creates three new columns: Name_Length counts characters in names, Salary_Category groups salaries into ranges, and Age_Group categorizes employees as young or experienced based on age.

Output
Transformed DataFrame:
            Name  Age  Salary    Age_Group Salary_Category
0  Alice Johnson   28   75000        Young          Medium
1      Bob Smith   32   82000  Experienced            High
2  Charlie Brown   29   68000        Young             Low
3   Diana Prince   31   79000  Experienced          Medium
4      Eve Adams   27   71000        Young          Medium

The data cleaning and transformation techniques you've learned in this section form the foundation of effective data analysis. By handling missing values appropriately and creating meaningful derived columns, you can ensure your data is ready for deeper analysis. Whether you choose to remove incomplete records or fill them with calculated values depends on your specific use case and the importance of preserving all available data.

These preprocessing steps might seem basic, but they're crucial for reliable results. Clean, well-structured data leads to more accurate insights and prevents errors in downstream analysis.

Final thoughts

Pandas makes working with CSV files fast, flexible, and intuitive—whether you're importing clean data or wrangling messy, real-world files. From reading and cleaning data to transforming and analyzing it, Pandas gives you all the tools you need to build reliable data workflows in Python.

Now that you’ve learned how to handle different CSV formats, manage missing data, and apply basic transformations, you’re well equipped to take on more complex data tasks.

For more advanced features and deeper exploration, check out the official Pandas documentation.

Got an article suggestion? Let us know
Next article
Get Started with Job Scheduling in Python
Learn how to create and monitor Python scheduled tasks in a production environment
Licensed under CC-BY-NC-SA

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Make your mark

Join the writer's program

Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.

Write for us
Writer of the month
Marin Bezhanov
Marin is a software engineer and architect with a broad range of experience working...
Build on top of Better Stack

Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.

community@betterstack.com

or submit a pull request and help us build better products for everyone.

See the full list of amazing projects on github