How to Read a Large CSV File In Pandas Python – Definitive Guide

Sometimes data in the CSV file might be huge, and Memory errors might occur while reading it.

You can read a large CSV file in Pandas python using the read_csv() method and the chunk size parameter. Alternatively, you can use the DUSK library to leverage distributed computing.

This tutorial teaches you how to efficiently read a large csv file in Pandas and avoid program crashing or memory errors.

The sample CSV file in the tutorial consists of over a million rows and is of size 20 GB.

Using Read_CSV and ChunkSize (Example)

The read_CSV() method reads the CSV file from the disk. However, using this method without specifying the chunks parameter will cause memory error when the file is enormous.

Though reading in chunks might help avoid memory problems but again, it would take you more time to concatenate the chunked dataframes into a single one.

To read large csv in chunks in pandas using read_csv(),

  • Use the chunksize parameter and specify the chunk size.
  • It’ll return a TextFileReader object for iteration.
  • You can use the concat method and concatenate into a single dataframe.

Use this method when you want to read only a sample chunk of data and use it for data analysis. To read and use the entire dataset, use the DASK library(Explained in the Next section).

Code

The following code demonstrates how to

  • Use the chunksize parameter to read large CSV files in chunks.
  • Concatenate the chunk of the data into a single dataframe.
  • While concatenating it is important to ignore indexes as there might be duplicate indexes.
import pandas as pd

import time

start = time.time()
chunked_dfs = pd.read_csv("huge.csv", chunksize=2000)
end = time.time()

print("Read csv with chunks: ",(end-start),"sec")


start = time.time()
df = pd.concat(chunked_dfs, ignore_index=True)
end = time.time()

print("Time taken to concatenate chunks into a single dataframe ",(end-start),"sec")

Output

    Read csv with chunks:  0.006186008453369141 sec
    Time taken to concatenate chunks into a single dataframe  9.055688858032227 sec

DataFrame Will Look Like

df

Sample rows are printed.

Unnamed: 0C1C2C3C4Target
0017863050871461021421039970719947ArL3d
4485028445304807716547503375514081Q66MZ

10000000 rows × 6 columns

Using Dask Dataframe

The dask library in python supports parallel computing and extends the standard libraries such as Pandas and Numpy for data manipulation.

Install Dask

To install dask, use the following code. Prefix % if you’re installing from Jupyter notebook.

%pip install dask

Dask is installed.

Reading Large CSV file using Dask

To read a huge CSV file using the dask library,

  • Import the dask dataframe
  • Use the read_csv() method to read the file. The large files will be read in a single execution.
  • It returns only a single dataframe and there is no need to concatenate chunked dataframes.
  • It takes less time to read a CSV file with over a million rows compared to the chunked reading supported by Pandas read_csv().

This is the fastest method to read a large CSV file in Python.

Use this method when you want to read the complete data from the large CSV file and use it for data analysis purposes.

Code

The following code demonstrates how to use the read_csv() method from the dask library.

import dask.dataframe as da

import time

start = time.time()
df = da.read_csv("huge.csv")
end = time.time()
print("Read csv with chunks: ",(end-start),"sec")

df.head()

Output

    Read csv with chunks:  0.020373821258544922 sec

DataFrame Will Look Like

Sample rows are printed.

Unnamed: 0C1C2C3C4Target
0017863050871461021421039970719947ArL3d
1133427593514724426247063546462273MvTDZ

Dask Vs Pandas

  • Pandas use a single CPU core, and Dask uses multiple CPU cores for processing the data.
  • Hence, dask taps the potential of a single CPU with multiple cores.
  • Dask also provides arrays that can substitute Numpy arrays.

Additional Resources

Leave a Comment