# Python Pandas: Working with CSV Files

[Pandas](https://pandas.pydata.org/) is a powerful Python library for working with structured data. It's widely used for data analysis and makes handling CSV files easy with built-in tools for reading, writing, and processing.

With flexible import/export options, strong data cleaning tools, and customizable transformations, Pandas is ideal for all kinds of data tasks.

This guide will show you how to work with CSV files using Pandas and set it up for your data processing needs.

[ad-logs]

## Prerequisites

Before starting this guide, make sure you have a recent version of [Python](https://www.python.org/downloads/) and `pip` installed.

## Getting started with Pandas

To get the most out of this tutorial, it's best to work in a fresh Python environment where you can test the examples. Start by creating a new project folder and setting it up with these commands:

```command
mkdir pandas-csv-tutorial && cd pandas-csv-tutorial
```

```command
python3 -m venv venv
```

Now activate your virtual environment and install what you need:

```command
source venv/bin/activate
```

Next, install the latest version of [pandas](https://pypi.org/project/pandas/) using the command below:

```command
pip install pandas
```

Create a new `main.py` file in your project's root directory and add this content:

```python
[label main.py]
import pandas as pd

# Create a simple DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

print(df)
```

This code imports the pandas library using the standard `pd` alias and creates a basic DataFrame from a Python dictionary.

You'll learn different ways to customize DataFrame creation and CSV operations later in this guide, but for now, run your first program:

```command
python main.py
```

You should see this output:

```text
[output]
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
```

The output shows a clean table format that demonstrates how pandas organizes data in rows and columns. The DataFrame includes automatically created row indices (0, 1, 2) and clearly labeled columns that match your original dictionary keys.


## Reading CSV files with Pandas

The `pd.read_csv()` function is your main tool for importing CSV data into pandas DataFrames. This function has many parameters that allow you to handle various CSV formats and data quality issues commonly found in real-world datasets.

Start by creating a sample CSV file to work with. Create a file called `sample_data.csv` with this content:

```csv
[label sample_data.csv]
Name,Age,City,Salary
Alice Johnson,28,New York,75000
Bob Smith,32,Los Angeles,82000
Charlie Brown,29,Chicago,68000
Diana Prince,31,Houston,79000
Eve Adams,27,Phoenix,71000
```

Now read this CSV file using the simplest approach:

```python
[label main.py]
import pandas as pd

# Read CSV file
df = pd.read_csv('sample_data.csv')
print(df)
print(f"\nDataFrame shape: {df.shape}")
print(f"Column names: {list(df.columns)}")
```
The `read_csv()` function automatically found the column headers from the first row and figured out the right data types for each column. 

The `shape` attribute tells you that your DataFrame has 5 rows and 4 columns, while the `columns` attribute gives you the column names.

When you run this code, you'll see this output:

```text
[output]
            Name  Age         City  Salary
0  Alice Johnson   28     New York   75000
1      Bob Smith   32  Los Angeles   82000
2  Charlie Brown   29      Chicago   68000
3   Diana Prince   31      Houston   79000
4      Eve Adams   27      Phoenix   71000

DataFrame shape: (5, 4)
Column names: ['Name', 'Age', 'City', 'Salary
```
The output shows a neatly formatted table with 5 rows and 4 columns: **Name**, **Age**, **City**, and **Salary**. The `shape` confirms the size `(5, 4)`, and `columns` lists the column names.

When you run the code, the result will look clean and structured, just like in the screenshot below:

![Screenshot 2025-06-24 at 3.55.13 PM.png](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/9dd62933-2e0c-41d3-3f98-4250e7bcd700/lg2x =1404x1170)


### Handling different CSV formats

Real-world CSV files often don't follow standard formatting. Pandas gives you many options to handle these variations. Look at this example with a differently formatted CSV:

```csv
[label messy_data.csv]
# Employee Data Export
# Generated on 2025-06-24
Name;Age;City;Salary;Department
Alice Johnson;28;New York;75000;Engineering
Bob Smith;32;Los Angeles;82000;Marketing
Charlie Brown;29;Chicago;68000;Sales
```

To read this file correctly, you need to specify extra parameters:

```python
[label main.py]
import pandas as pd

# Read CSV with custom parameters
df = pd.read_csv(
    'messy_data.csv',
    sep=';',           # Use semicolon as separator
    skiprows=2,        # Skip the first 2 comment lines
    encoding='utf-8'   # Specify encoding
)

print(df)
```

In this code, you're using `pandas.read_csv()` with extra options to handle a non-standard CSV file. The file uses semicolons instead of commas, and has two comment lines at the top.

* `sep=';'` tells Pandas to use semicolons as column separators.
* `skiprows=2` skips the first two lines (comments).
* `encoding='utf-8'` ensures proper character decoding.

The output is a clean DataFrame with correctly parsed columns:

```text
[output]
            Name  Age         City  Salary   Department
0  Alice Johnson   28     New York   75000  Engineering
1      Bob Smith   32  Los Angeles   82000    Marketing
2  Charlie Brown   29      Chicago   68000        Sales
```

Even with unusual formatting, Pandas lets you load the data cleanly and accurately.

### Controlling data types during import

Pandas automatically guesses data types, but sometimes you need to control exactly how columns get interpreted. This is important for keeping your data accurate and using memory efficiently:

```python
[label main.py]
import pandas as pd

# Specify data types explicitly
dtype_dict = {
    'Name': 'string',
    'Age': 'int32',
    'Salary': 'float64'
}

df = pd.read_csv('sample_data.csv', dtype=dtype_dict)
print(df.dtypes)
print(f"\nMemory usage:\n{df.memory_usage(deep=True)}")
```

```text
Name      string[python]
Age                int32
City              object
Salary           float64
dtype: object

Memory usage:
Index     132
Name      301
Age        20
City      285
Salary     40
dtype: int64
```

When you specify the right data types, you can significantly reduce memory usage, especially when working with large datasets. The `string` dtype uses less memory than the default `object` type for text data.


## Processing and cleaning CSV data

Once you've imported your CSV data, the next step is cleaning and preprocessing to ensure good data quality. Real-world datasets often have inconsistencies, missing values, and formatting issues that need fixing before analysis.

### Handling missing values

Missing data is one of the most common problems in data processing. Pandas gives you several ways to find and deal with missing values.

First, let's create a sample dataset with missing values to see how Pandas handles them:

```python
[label main.py]
import pandas as pd
import numpy as np

# Create sample data with missing values
data = {
    'Name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'Age': [28, 32, 29, None, 27],
    'City': ['New York', None, 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [75000, 82000, np.nan, 79000, 71000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print(f"\nMissing values per column:\n{df.isnull().sum()}")
```

This code creates a DataFrame with intentional missing values (None and np.nan) across different columns. The `isnull().sum()` method counts missing values in each column.

```text
[output]
Original DataFrame:
    Name   Age      City   Salary
0  Alice  28.0  New York  75000.0
1    Bob  32.0      None  82000.0
2   None  29.0   Chicago      NaN
3  Diana   NaN   Houston  79000.0
4    Eve  27.0   Phoenix  71000.0

Missing values per column:
Name      1
Age       1
City      1
Salary    1
dtype: int64
```

You can deal with missing values in different ways depending on your data and analysis needs. Here are two common strategies:

```python
[label main.py]
...
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print(f"\nMissing values per column:\n{df.isnull().sum()}")

[highlight]
import pandas as pd

# Strategy 1: Drop rows with any missing values
df_dropped = df.dropna()
print("After dropping rows with missing values:")
print(df_dropped)

# Strategy 2: Fill missing values with defaults
df_filled = df.fillna({
    'Name': 'Unknown',
    'Age': df['Age'].median(),
    'City': 'Not Specified',
    'Salary': df['Salary'].mean()
})
print(f"\nAfter filling missing values:")
print(df_filled)
[/highlight]
```

The first strategy removes any rows that have missing values, while the second fills missing values with sensible defaults like medians, means, or placeholder text.

```text
[output]
Original DataFrame:
...
Missing values per column:
Name      1
Age       1
City      1
Salary    1
dtype: int64
After dropping rows with missing values:
    Name   Age      City   Salary
0  Alice  28.0  New York  75000.0
4    Eve  27.0   Phoenix  71000.0

After filling missing values:
      Name   Age           City   Salary
0    Alice  28.0       New York  75000.0
1      Bob  32.0  Not Specified  82000.0
2  Unknown  29.0        Chicago  76750.0
3    Diana  28.5        Houston  79000.0
4      Eve  27.0        Phoenix  71000.0
```

You can handle missing values in different ways, depending on your needs.

In this code, two common strategies are shown:

* **Strategy 1:** `dropna()` removes any rows with missing values.
* **Strategy 2:** `fillna()` replaces missing values with sensible defaults—like the median age, mean salary, or placeholder text.

The result is a cleaner DataFrame, whether you choose to drop or fill the gaps.

You’ll see the output formatted clearly, just like in the screenshot below:

![Screenshot 2025-06-24 at 4.09.28 PM.png](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/927e1223-5294-4e83-5be7-527362055000/lg1x =1390x1340)


### Data transformation and normalization

Data transformation often means standardizing formats, converting data types, and creating new columns for analysis. This helps you derive more insights from your existing data.

Let's add some useful calculated columns to our employee dataset:

```python
[label main.py]
import pandas as pd

# Read our sample data
df = pd.read_csv('sample_data.csv')

# Transform and enhance the data
df['Name_Length'] = df['Name'].str.len()
df['Salary_Category'] = pd.cut(df['Salary'], 
                              bins=[0, 70000, 80000, float('inf')], 
                              labels=['Low', 'Medium', 'High'])
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Experienced')

print("Transformed DataFrame:")
print(df[['Name', 'Age', 'Salary', 'Age_Group', 'Salary_Category']])
```

This code creates three new columns: `Name_Length` counts characters in names, `Salary_Category` groups salaries into ranges, and `Age_Group` categorizes employees as young or experienced based on age.

```text
[output]
Transformed DataFrame:
            Name  Age  Salary    Age_Group Salary_Category
0  Alice Johnson   28   75000        Young          Medium
1      Bob Smith   32   82000  Experienced            High
2  Charlie Brown   29   68000        Young             Low
3   Diana Prince   31   79000  Experienced          Medium
4      Eve Adams   27   71000        Young          Medium
```

The data cleaning and transformation techniques you've learned in this section form the foundation of effective data analysis. By handling missing values appropriately and creating meaningful derived columns, you can ensure your data is ready for deeper analysis. Whether you choose to remove incomplete records or fill them with calculated values depends on your specific use case and the importance of preserving all available data.

These preprocessing steps might seem basic, but they're crucial for reliable results. Clean, well-structured data leads to more accurate insights and prevents errors in downstream analysis.


## Final thoughts
Pandas makes working with CSV files fast, flexible, and intuitive—whether you're importing clean data or wrangling messy, real-world files. From reading and cleaning data to transforming and analyzing it, Pandas gives you all the tools you need to build reliable data workflows in Python.

Now that you’ve learned how to handle different CSV formats, manage missing data, and apply basic transformations, you’re well equipped to take on more complex data tasks.

For more advanced features and deeper exploration, check out the [official Pandas documentation](https://pandas.pydata.org/).