The most common form of data I have played with is generally found in CSV format. This makes the data easy to understand but difficult to modify or change, especially when the CSV file contains lots of columns.
In this article I will demonstrate how I frequently use this file format to import it into a useable data structure in Python, namely, a list of Python dictionaries.
Let’s start with defining the basic terminology so we’re on the same page before jumping into some examples.
What Is A CSV File?
CSV (Comma-Separated Values) files are a common data format used in many applications for storing tabular data. Simple and easy to read, CSV files are often used for exporting data from databases, spreadsheets, and other data management tools.
Despite their simplicity, handling CSV data efficiently in programming can sometimes be challenging – especially when there are lots of fields (columns). This is where Python, with its powerful libraries and simple syntax, becomes a valuable tool.
What Is A List Of Dictionaries?
In Python, dictionaries are versatile data structures that allow storage of data in a key-value pair format. This format is particularly useful for data manipulation and retrieval, as it enables fast access to data elements and provides an intuitive way to work with complex data structures.
A list in Python is another form of storing data that separates fields into different elements. This means a list in Python can store a combination of different data types (or all the same data types), such as numbers, strings, objects, or even other lists!
The concept of a Python list can be likened to a spreadsheet sheet with rows of data. In a spreadsheet, a column is a vertical arrangement of cells, where each cell can hold different types of data such as numbers, text, or dates (etc). Similarly, in Python, a list is an ordered collection of items, where each item can be of different data types, such as integers, strings, or even other lists.
Name | Age | Dept |
---|---|---|
John | 28 | Finance |
Jane | 32 | IT |
Just as you can add more rows underneath each column, you can add more items to a Python list.
Each row in the spreadsheet is identified by its row number, meaning all information contained in the same table on the same row has some relation. In the above example you could surmise that John is aged 28 and works in the Finance department.
This same information, which is displayed quite beautifully in a spreadsheet, would be contained within a Python list. The contents of each cell are identified by its position in the list. For example, if you were to construct the same data of John, aged 28 working in the Finance department into a Python list, it might look something like this in its most simplest sense:
["John", 28, "Finance"]
If you perhaps got John’s age wrong to edit the contents of this list, you would need to know the column index representing the field to change. This is where converting a CSV file into a list of dictionaries has benefits over converting a CSV file into a list of lists – you only need to know the column name , rather than the index position of the column .
Finally, just as a spreadsheet allows for operations like sorting or filtering on a column, Python provides various methods to manipulate lists, such as sorting items, reversing their order, or searching for specific elements.
Therefore, if you imagine a spreadsheet column where each cell can hold any type of data, and you can dynamically add, remove, or change the contents of these cells, you have a good conceptual analogy for a Python list.
What Is A List Of Dictionaries?
Combining both concepts of a
list
and a
dictionary
in Python and continuing with our example of a spreadsheet where each row represents a record, and each column header represents a specific attribute or field of that record. In this analogy:
- Python List: This represents the entirety of the spreadsheet. Each element in the list corresponds to a row in the spreadsheet.
- Python Dictionary: Each dictionary within the list represents a single row in the spreadsheet. The keys of the dictionary are akin to the column headers in the spreadsheet, indicating the attribute or field name . The values of the dictionary are like the cells in a row, containing the data for each attribute.
For example, consider a spreadsheet that tracks employee information with columns for
"Name"
,
"Age"
, and
"Department"
. Each row in the spreadsheet contains data for
one employee
.
In Python, this spreadsheet could be represented as a list of dictionaries. Each dictionary in the list corresponds to one row (one employee), with
keys
being
"Name"
,
"Age"
, and
"Department"
, and the values are the specific details for each employee.
Therefore, a list of dictionaries in Python is a structured way to represent data that you might typically store in a tabular format like a spreadsheet, where each dictionary within the list holds related data points (akin to a record) organised by keys (similar to column headers).
Why Converting CSV to Dictionaries is Useful
As you can see, a list of dictionaries can be easily understood when working with a spreadsheet, which is why it makes perfect sense to create this data structure when working with CSV files.
Here are some benefits if you’re still unsure of whether you want to import a CSV file into a list of dictionaries in Python:
- Enhanced Data Accessibility: In a dictionary, each row from the CSV can be accessed as a separate entity with clear key-value pairs, making it easier to read and manipulate specific data fields.
- Improved Readability: Data stored in dictionaries tends to be more readable, especially when dealing with large datasets. The key-value pairing provides a clear understanding of what each data point represents.
- Ease of Manipulation: With dictionaries, data can be easily modified, added, or removed. This flexibility is especially useful when performing data transformations or cleaning tasks.
- Compatibility with Python Ecosystem: Python dictionaries are highly compatible with various Python libraries and frameworks, making them ideal for further data processing, analysis, and visualization tasks.
- Error Handling: When data is loaded into dictionaries, it’s easier to handle inconsistencies and errors in CSV files, such as missing values or varied data formats.
- Dynamic Structure: Unlike arrays or lists, dictionaries do not require a uniform structure across all rows. This allows for handling CSV files with varying columns or complex nested data.
Best Method: Use
csv.DictReader
Let’s jump into some Python code and explore how to import CSV data into a list of dictionaries.
The easiest way to get a CSV file into a list of dictionaries is by using the
csv
Python library
.
Within the
csv
module in Python, a class called
DictReader
makes it straightforward to read a CSV file into a list of dictionaries.
Each row in the CSV file is converted into a dictionary, with the keys being the header row (by default) and the values being the corresponding data from each row.
Basic Example
Consider a simple CSV file named
employees.csv
with the following content:
Name,Age,Dept
John,28,Finance
Jane,32,IT
This file can be read into a list of dictionaries using the following Python code:
import csv
# Define the path to the CSV file
csv_path = 'emp.csv'
# Initialize an empty list to hold the dictionaries
result = []
# Open the CSV file
with open(csv_path, 'r') as f:
# Create a csv.DictReader object
csv_reader = csv.DictReader(f)
result = [*csv_reader]
# Show result
print(result)
Here’s a video demonstrating the above code (without the comment lines):
Explanation of the Code
Here’s a detailed description of the above code:
import csv
This line imports Python’s built-in CSV module, which provides the necessary functions and classes for handling CSV files.
csv_path = 'emp.csv'
This is the path where you provide the path to the CSV file. You might need to construct a full path using os.path.join(…) depending on the location of the CSV relative to where the Python file is located.
result = []
This will be the list you initialise to hold the dictionaries for each row in the CSV file.
with open(csv_path, 'r') as f:
This opens up the file object in read-only mode (
'r'
) and by opening the file within a
with
statement it ensures the file is properly closed once the code inside finishes.
csv_reader = csv.DictReader(f)
The file
f
is then passed as the primary parameter in the
DictReader
class. The magic of converting the CSV file into a dictionary happens within this class. It automatically uses the
first row
of the CSV file (the headers) as the keys for each dictionary.
result = [*csv_reader]
To get each of the dictionaries into a list, I use the asterisk operator to unpack each dictionary into an element for use within a list.
What If I Wanted To Perform A Check On Each Dictionary?
The above code represents the quickest way to get a list of dictionary items.
If, however, you want to remove, add or check each dictionary, you could modify the code as follows to iterate through each dictionary:
import csv
csv_path = 'emp.csv'
result = []
with open(csv_path, 'r') as file:
csv_reader = csv.DictReader(file)
for row in csv_reader:
row['Org'] = "ABC"
result.append(row)
I’ve found this to be quite a useful technique, especially when you’re importing date strings to a
datetime.date
data type.
Here’s a demonstration of what the above code looks like in the Python console:
As you can see, the only real difference between this code to the former is that instead of wrapping everything into a list in one line of code, you now have more control over what is to be imported to your list.
Therefore, the new lines are as follows:
for row in csv_reader:
This loop iterates over each row in the CSV file.
row['Org'] = "ABC"
Do whatever operating you would like to perform under this
for
loop. If you want to add an additional property to the dictionary, such as the name of the organisation,
Org
then you can do that here.
Modifying Date Fields
My most frequent modification of the values for a property in the dictionary is to modify a date field into a
date
data type.
The CSV
DictReader
class will import the data in these fields as a string as represented in the CSV file. Therefore, if your script needs to operate on the data contained in these fields as dates, then you will need to find a way to import them into your list of dictionaries as
date
values.
Suppose the original CSV data file didn’t have the
Age
of the employees, but instead had the
Date of Birth
as a
D/MM/YYYY
field, like so:
Name,Date of Birth,Dept
John,28/10/1977,Finance
Jane,03/02/1975,IT
If you wanted to operate on the
Date of Birth
field as a date in your Python script, then you will need to change the
for-loop
section to something like this:
import csv
from datetime import datetime
csv_path = 'emp.csv'
result = []
with open(csv_path, 'r') as f:
csv_reader = csv.DictReader(f)
for row in csv_reader:
row['DOB'] = datetime.strptime(row['Date of Birth'], '%d/%m/%Y')
result.append(row)
Here’s a demonstration of the above code and its results:
Similar to the previous example, which showed how you can operate on each dictionary before inserting it into your list, so too can you perform any necessary changes on the data to a data type needed in your Python code.
How Does
DictReader
Handle Missing Data?
Another common problem you will come across is how to handle missing data in your CSV file: just what does
csv.DictReader
do with blank or empty fields?
If we use the following CSV data, let’s see the result when passing it through our first coding example (the basic example) because Jane didn’t want to disclose her age:
Name,Age,Dept
John,28,Finance
Jane,,IT
Passing this through produces the following result:
As you can see from the above demo, the blank values are handled as empty strings .
If it is critical within your code that these values are either numbers or
None
then you will want to perform your operations in the
for-loop
on each row.
Use A Different Header Row Than First Row
What if you don’t want to use the first row in the CSV file for the keys of each dictionary?
I recently had a CSV file that contained some header information, and I needed to ignore the first few lines of the file.
A simplified example of what I was working on looked like this with a nonsensical header and a blank row:
Some useless info with blank rows before the actual header row
Name,Age,Dept
John,28,Finance
Jane,,IT
To get this to work in converting your CSV file into a list of dictionaries, you’ll need to exclude these former lines from your CSV data.
One method I chose to use was the
readlines()
method (as the file wasn’t terribly large) and using the
slice operator
.
Here’s how this code looked:
import csv
csv_path = 'emp.csv'
result = []
with open(csv_path, 'r') as f:
lines = f.readlines()[3:]
reader = csv.DictReader(lines)
result = [*reader]
Here’s how the resulting code looks:
Here’s what the lines in the code mean:
lines = f.readlines()[3:]
I used the file method
readlines()
to read all the lines into memory of the CSV file, and from this result, I used the slice operator to exclude the first 3 lines as these three lines in my working example contained either blank lines or a nonsensical line that wasn’t necessary for processing into a dictionary.
Note that this line assumes you know how many lines to exclude . I’ll demonstrate how you can get around this limitation below.
reader = csv.DictReader(lines)
The only other change needed was to use the newly created variable
lines
instead of the file handler
f
.
What if you don’t know the number of header rows in your CSV file to skip?
If you don’t know the number of lines to exclude, then you can use a regular expression to identify how many lines need to be excluded. This would mean having some form of identifiable information in the header that is unique to that row.
Continuing with our working example, as the header row starts with the string “Name” I could use that as a way to identify the header row using a regular expression.
My code would be a little more robust using this technique and would look something like this:
import csv
import re
csv_path = 'emp.csv'
result = []
with open(csv_path, 'r') as f:
lines = f.readlines()
start = [idx for idx, x in enumerate(lines) if re.match('^Name', x)]
reader = csv.DictReader(lines[start[0]:])
result = [*reader]
Here’s how this result looks:
Here’s what the additional lines of code mean:
import re
To use regular expressions, you need to include the regular expression library by importing it.
start = [idx for idx, x in enumerate(lines) if re.match('^Name', x)]
This is a
list comprehension
which uses the
enumerate()
function enabling use of counting the lines using an index counter. The iteration through the
lines
variable uses the
re.match()
function to find when a line starts with
Name,
if the line doesn’t start with that string then
re.match()
returns
None
.
reader = csv.DictReader(lines[start[0]:])
As the
start
variable is a list, with hopefully only 1 item in it. I use
start[0]
which should be the index row where the regular expression was matched, and I use this in the slice operator on the
lines
variable to remove all the useless rows in my CSV file.
How Do I Exclude Rows?
In the same principle listed above with removing superfluous header rows in the CSV file, you can do the same in removing any rows not needed in your list of dictionaries.
If there are rows above and below in a fixed width you can use a slice operator on the
lines
variables to extract only what you need.
You also don’t need to worry about filtering blank rows out, unless those blank rows are found before your header row. Have a look at what you’ll get when you don’t remove the blank header lines from your CSV data:
Or, an alternative solution could be to iterate through the
lines
variable and to use a regular expression on each line and to either keep or not keep that line according to your conditions.
For example, one common thing is to exclude blank rows, which
DictReader
handles nicely by itself, so you don’t need to worry about excluding empty lines.
Here’s a quick video demonstrating this:
But what if you were handling dirty data?
You would need to find a way to identify the dirty rows and the clean rows.
For example, suppose I had the following data set:
ABC
Name,Age,Dept
123
John,28,Finance
456
789
Jane,,IT
000
I would be able to purge the unnecessary rows by simply identifying if there are two commas in each row. Your use case will be different.
To apply a filter to my CSV file
lines
I could write something like this:
import csv
csv_path = 'emp.csv'
result = []
with open(csv_path, 'r') as f:
lines = f.readlines()
clean = [x for x in lines if len(x) - len(x.replace(",", "") == 2]
reader = csv.DictReader(clean)
result = [*reader]
Here’s the outcome of the above code:
The section in the above code that applies the filter is this line:
clean = [x for x in lines if len(x) - len(x.replace(",", "") == 2]
This code simply replaces all commas with an empty string and then calculates how many commas were removed, and if there were 2 it would be a line I would need for my list of dictionaries.
Obviously, you will need to determine how to filter your data.
How To Filter If Field Does Not Contain Data?
Sometimes, there are fields that must contain data in your CSV file, and if there doesn’t happen to be anything, then one of your rules may be to exclude that dictionary from being included in your resulting list.
If there are required fields in your CSV file you can either use the methods outlined above in being able to identify those rows and excluding them in your
clean
variable, or you could instead operate in the
for loop
of the
DictReader
.
For example, let’s assume with our current working example that
Age
is a required field.
Here’s our working example with Jane not disclosing her age:
Name,Age,Dept
John,28,Finance
Jane,,IT
Here’s the resulting code on how filtering can be applied
after
the data has been consumed using
DictReader
– and the reason why I want to apply the filter
after
DictReader
is that it is
much easier
to fetch the data (rather than using regular expressions, or other techniques etc).
import csv
csv_path = 'emp.csv'
result = []
with open(csv_path, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
if len(row['Age']):
result.append(row)
Instead of creating your list using the asterisk operator, now you can filter by looping through each row and applying techniques to check the data before it is added to your result list.
Here’s how this code performed:
Replace Header Row With New Column Names
What if you wanted to replace the first row headers with your own headers? Is there an easy way to do this?
Yes, you can! Simply use the
fieldnames
property to send through a tuple of strings representing the fields you want to rename.
For example, suppose you have the following CSV data:
Name,Age,Dept
John,28,Finance
Jane,27,IT
If you wanted the be more precise with the field names and replace
Name
with
First Name
and
Dept
with
Department
, you could specify this in the
DictReader
parameter
fieldnames
:
import csv
csv_path = 'emp.csv'
result = []
with open(csv_path, 'r') as f:
new_headers = ('First Name', 'Age', 'Department')
reader = csv.DictReader(f, fieldnames=new_headers)
result = [*reader]
Here’s how this result would look:
Advanced Header Row Manipulation
In a recent CSV data import I found I had to modify the keys by simply removing excess whitespace within the header titles.
For example, suppose the data set was modified to something like this:
First Name,Last Name , Age,Dept.
John,Smith,28,Finance
Jane,Doe,27,IT
The problem with that header row in the CSV data above is:
-
Extra spacing between words (
First
andName
have too many spaces and can lead to inconsistency when importing multiple files) -
Spacing at the end of a header field (
Last Name
contains some extra spacing at the end) -
Spacing at the beginning of a header title (
Age
contains spacing before it) -
Unnecessary characters in the header field (
Dept
being an abbreviation ofDepartment
contains a full stop)
So you might find it necessary to use the
fieldnames
parameter instead of the default way
csv.DictReader
handles the header row.
This would mean you would need to capture the header row, loop through each field to perform your cleansing powers and then output the result into a sequence data type that
fieldnames
needs to operate.
Here’s an example of how I recently achieved this result with the “dirty headers” I had to deal with:
import csv
import re
csv_path = 'emp.csv'
result = []
def cleanse(s: str):
s = re.sub(r'[^A-Za-z0-9_\s], '', s)
s = re.sub(r'\s+', ' ', s)
return s.strip()
with open(csv_path, 'r') as f:
dirty_header = f.readlines()[0]
dirty_header_list = dirty_header.split(",")
new_headers = *(cleanse(x) for x in dirty_header_list),
reader = csv.DictReader(f, fieldnames=new_headers)
result = [*reader]
Here’s how the result looked, and I added a few more print outputs so that you could see each of the results along the way:
Breaking down the code, here’s what the new sections mean:
def cleanse(s: str):
s = re.sub(r'[^A-Za-z0-9_\s], '', s)
s = re.sub(r'\s+', ' ', s)
return s.strip()
This function is what I used to pass each header element through. This code simply takes a single string parameter and removes all non-alphanumeric characters (except the underscore and space character).
The next line removes any duplicate spaces, with the return line stripping any unwanted leading and trailing spaces.
dirty_header = f.readlines()[0]
dirty_header_list = dirty_header.split(",")
Inside the
with
statement I read the header row, which is the first line in the CSV file, and then split this row into individual elements using the string
split
method using the comma as the dividing character.
new_headers = *(cleanse(x) for x in dirty_header_list),
As I would like to get everything into a tuple, I use the asterisk operator to get the list comprehension result into a tuple. Kudos to SO for this tip .
The rest of the code then performs the usual function of inserting the CSV file into a list comprehension, and as you can see from the result, the outcome is a list comprehension with nice clean keys.
CSV File To List Of Dictionaries Template
Using the
csv.DictReader()
class I was able to structure the CSV data into Python in the following way, kind of like a guide or template with any new projects going forward:
# The list for where the dictionaries will be stored
result = list()
with open(file_path) as f:
# Do I need to cleanse the dictionary keys?
# Yes: but I will manually write the headers...
my_headers = ('Header 1', 'Header 2')
# Yes: but I just want to clean the current headers...
lines = f.readlines()
# Write the line of the file where the header can be found
# In index format (i.e. first line is 0, second line is 1, etc)...
header_line_idx = 0
# OR, loop through the readlines to find the line that
# matches a condition to find the header row...
header_line_idx = [idx for idx, x in enumerate(lines) if re.match(r'^Header', x)][0]
header_row = lines[header_line_idx]
# Create a list of all the elements in the header...
header_list = header_row.split(",")
# Create a cleansing function and iterate through each element
# to get it to be cleaned to the style you want
my_headers = *(cleanse(x) for x in header_row)
# Capturing the CSV file contents into a list of dictionaries
reader = csv.DictReader(f, fieldnames=my_headers)
# Do I need to cleanse/check dictionary or add more
# keys with values?
# No:
result = [*reader]
# Yes:
# Loop through each item and do your value cleansing...
for row in reader:
# Do all the cleansing work or extra keys work in here...
# ...
# With the last line in this nested for-loop being...
result.append(row)
This template is the basis of how I will structure my code when importing CSV data into my Python scripts.
If you have any feedback or suggestions, please check out my gist here .