How To Solve Unicode Decode Error While Reading CSV File in Pandas – With Examples

Each file has a specific encoding that denotes the characters’ types.

You can solve the UnicodeDecodeError: 'utf-8' codec can't decode byte error by detecting the proper encoding of the file and passing the charset into the read_csv() method.

Syntax

import pandas as pd

df = pd.read_csv('Oscars.csv', encoding='ISO-8859-1')

df

This tutorial explains how to reproduce the UnicodeDecodeError, how to detect the proper encoding of the file using the chardet library and how to use it while reading the CSV file.

Reason For Error

The read_csv() method uses the UTF-8 encoding by default while reading the CSV files. The UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position XX: invalid continuation byte occurs if the CSV file contains a character that is not supported in UTF-8 encoding.

Read CSV Without Encoding

The Oscars.csv file contains special characters encoded with the character ISO-8859-1.

  • When reading it using the read_csv() file without passing the charset explicitly, it throws the UnicodeDecodeError.
  • Because the read_csv() method uses UTF-8 encoding by default.

Code

import pandas as pd

df = pd.read_csv('Oscars.csv')

df

Error

    UnicodeDecodeError                        Traceback (most recent call last)

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 10443: invalid continuation byte.

Detect File Encoding

There are many different charsets available.

To detect the proper charset of your file, use the chardet library. It is the universal character encoding detector library.

  • Open the file in the read mode
  • Use the chardet.detect() method to detect the file’s charset. Pass more number of bytes to read in the detect() method. So the detected charset might be accurate
  • Print the result to know the charset of your file.

Code

import chardet

with open('Oscars.csv', 'rb') as filedata:
    result = chardet.detect(filedata.read(100000))
result

Output

    {'encoding': 'ISO-8859-1', 'confidence': 0.7289274470020289, 'language': ''}

The current CSV uses the ISO-8859-1 encoding.

Read CSV With Encoding

You have detected the charset of the CSV file.

While reading the CSV file, pass the charset using the encoding parameter.

The file will be read, and the dataframe will be created without the UnicodeDecodeError.

Code

The following code demonstrates how to use the encoding while reading the CSV file.

import pandas as pd

df = pd.read_csv('Oscars.csv', encoding='ISO-8859-1')

df

DataFrame Will Look Like

_unit_id_golden_unit_state_trusted_judgments_last_judgment_atbirthplacebirthplace:confidencedate_of_birthdate_of_birth:confidencerace_ethnicityawardbiourlbirthplace_golddate_of_birth_goldmoviepersonrace_ethnicity_goldreligion_goldsexual_orientation_goldyear_of_award_gold
0670454353Falsefinalized32/10/15 3:45Chisinau, Moldova1.030-Sep-18951.0WhiteBest Directorhttp://www.nndb.com/people/320/000043191/NaNNaNTwo Arabian KnightsLewis MilestoneNaNNaNNaNNaN
1670454354Falsefinalized32/10/15 2:03Glasgow, Scotland1.02-Feb-18861.0WhiteBest Directorhttp://www.nndb.com/people/626/000042500/NaNNaNThe Divine LadyFrank LloydNaNNaNNaNNaN
2670454355Falsefinalized32/10/15 2:05Chisinau, Moldova1.030-Sep-18951.0WhiteBest Directorhttp://www.nndb.com/people/320/000043191/NaNNaNAll Quiet on the Western FrontLewis MilestoneNaNNaNNaNNaN
3670454356Falsefinalized32/10/15 2:04Chicago, Il1.023-Feb-18991.0WhiteBest Directorhttp://www.nndb.com/people/544/000041421/NaNNaNSkippyNorman TaurogNaNNaNNaNNaN
4670454357Falsefinalized32/10/15 1:48Salt Lake City, Ut1.023-Apr-18941.0WhiteBest Directorhttp://www.nndb.com/people/292/000044160/NaNNaNBad GirlFrank BorzageNaNNaNNaNNaN

441 rows × 27 columns

Additional Resources

Leave a Comment