Each file has a specific encoding that denotes the characters’ types.
You can solve the UnicodeDecodeError: 'utf-8' codec can't decode byte error
by detecting the proper encoding of the file and passing the charset into the read_csv()
method.
Syntax
import pandas as pd
df = pd.read_csv('Oscars.csv', encoding='ISO-8859-1')
df
This tutorial explains how to reproduce the UnicodeDecodeError, how to detect the proper encoding of the file using the chardet library and how to use it while reading the CSV file.
Table of Contents
Reason For Error
The read_csv()
method uses the UTF-8
encoding by default while reading the CSV files. The UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position XX: invalid continuation byte
occurs if the CSV file contains a character that is not supported in UTF-8
encoding.
Read CSV Without Encoding
The Oscars.csv
file contains special characters encoded with the character ISO-8859-1
.
- When reading it using the read_csv() file without passing the charset explicitly, it throws the
UnicodeDecodeError
. - Because the
read_csv()
method usesUTF-8
encoding by default.
Code
import pandas as pd
df = pd.read_csv('Oscars.csv')
df
Error
UnicodeDecodeError Traceback (most recent call last)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 10443: invalid continuation byte.
Detect File Encoding
There are many different charsets available.
To detect the proper charset of your file, use the chardet library. It is the universal character encoding detector library.
- Open the file in the read mode
- Use the
chardet.detect()
method to detect the file’s charset. Pass more number of bytes to read in thedetect()
method. So the detected charset might be accurate - Print the result to know the charset of your file.
Code
import chardet
with open('Oscars.csv', 'rb') as filedata:
result = chardet.detect(filedata.read(100000))
result
Output
{'encoding': 'ISO-8859-1', 'confidence': 0.7289274470020289, 'language': ''}
The current CSV uses the ISO-8859-1
encoding.
Read CSV With Encoding
You have detected the charset of the CSV file.
While reading the CSV file, pass the charset using the encoding
parameter.
The file will be read, and the dataframe will be created without the UnicodeDecodeError
.
Code
The following code demonstrates how to use the encoding while reading the CSV file.
import pandas as pd
df = pd.read_csv('Oscars.csv', encoding='ISO-8859-1')
df
DataFrame Will Look Like
_unit_id | _golden | _unit_state | _trusted_judgments | _last_judgment_at | birthplace | birthplace:confidence | date_of_birth | date_of_birth:confidence | race_ethnicity | … | award | biourl | birthplace_gold | date_of_birth_gold | movie | person | race_ethnicity_gold | religion_gold | sexual_orientation_gold | year_of_award_gold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 670454353 | False | finalized | 3 | 2/10/15 3:45 | Chisinau, Moldova | 1.0 | 30-Sep-1895 | 1.0 | White | … | Best Director | http://www.nndb.com/people/320/000043191/ | NaN | NaN | Two Arabian Knights | Lewis Milestone | NaN | NaN | NaN | NaN |
1 | 670454354 | False | finalized | 3 | 2/10/15 2:03 | Glasgow, Scotland | 1.0 | 2-Feb-1886 | 1.0 | White | … | Best Director | http://www.nndb.com/people/626/000042500/ | NaN | NaN | The Divine Lady | Frank Lloyd | NaN | NaN | NaN | NaN |
2 | 670454355 | False | finalized | 3 | 2/10/15 2:05 | Chisinau, Moldova | 1.0 | 30-Sep-1895 | 1.0 | White | … | Best Director | http://www.nndb.com/people/320/000043191/ | NaN | NaN | All Quiet on the Western Front | Lewis Milestone | NaN | NaN | NaN | NaN |
3 | 670454356 | False | finalized | 3 | 2/10/15 2:04 | Chicago, Il | 1.0 | 23-Feb-1899 | 1.0 | White | … | Best Director | http://www.nndb.com/people/544/000041421/ | NaN | NaN | Skippy | Norman Taurog | NaN | NaN | NaN | NaN |
4 | 670454357 | False | finalized | 3 | 2/10/15 1:48 | Salt Lake City, Ut | 1.0 | 23-Apr-1894 | 1.0 | White | … | Best Director | http://www.nndb.com/people/292/000044160/ | NaN | NaN | Bad Girl | Frank Borzage | NaN | NaN | NaN | NaN |
441 rows × 27 columns