Sklearn datasets become handy for learning machine learning concepts. When using the sklearn datasets, you may need to convert them to pandas dataframe for manipulating and cleaning the data.
You can convert the sklearn dataset to pandas dataframe by using the pd.Dataframe(data=iris.data) method.
In this tutorial, you’ll learn how to convert sklearn datasets into pandas dataframe.
If You’re in Hurry…
You can use the below code snippet to convert the sklearn dataset to pandas dataframe.
Snippet
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df["target"] = iris.target
df.head()
When you print the dataframe using the df.head()
method, you’ll see the pandas dataframe created by using the sklearn iris dataset.
Dataframe Will Look Like
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
This is how you can convert the sklearn dataset to a pandas dataframe.
If You Want to Understand Details, Read on…
In this tutorial, you’ll learn how to convert sklearn datasets to pandas dataframe while using the sklearn datasets to create a machine learning models.
Table of Contents
Sklearn Datasets
Sklearn datasets are datasets that are readily available to you for creating or practicing machine learning activities. By using this, you do not need to download data as a CSV file to your local machine. You can directly use the datasets objects from the sklearn library.
Pandas dataframes are two-dimensional data structure which stores data in a rows and columns format and it provides a lot of data manipulation functionalities that are useful for feature engineering.
You can use the below sections to convert sklearn datasets to dataframes as per your need.
Converting Sklearn Datasets To Dataframe Without Column Names
In this section, you’ll convert the sklearn datasets to dataframes without columns names.
You can use this when you want to convert the dataset to pandas dataframe for some visualization purposes.
The columns will be named with the default indexes 0, 1, 2, 3, 4, and so on.
Snippet
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data)
df["target"] = iris.target
df.head()
Dataframe Will Look Like
0 | 1 | 2 | 3 | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Next, you’ll learn about the column names.
With Column Names
Column names in pandas dataframe are very useful for identifying the columns/features in the dataframe. In this section, you’ll learn how to convert the sklearn dataset with column names.
Converting Sklearn Datasets To Dataframe Using Feature Names As Columns
Sklearn providers the names of the features in the attribute feature_names
. You can use this attribute in the pd.DataFrame()
method to create the dataframe with the column headers.
If the dataset is a classification type dataset, then sklearn also provides the target variable for the samples in the attribute target
. You can use the target
to fetch the target values and append it into your dataframe
Snippet
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df["target"] = iris.target
df.head()
When you print the dataframe with the df.head()
, you’ll see the dataframe with the column headers.
Dataframe Will Look Like
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
This is how you can convert the sklearn dataset to pandas dataframe with column headers by using the sklearn datasets’ feature_names
attribute.
Later, if you want to rename the features, you can also rename the dataframe columns.
Using Custom Column Headers
In some cases, you may need to use custom headers as columns rather than using the sklearn datasets feature_names
attribute.
You can do it by passing the list of column headers as the list to the pd.Dataframe()
method.
For example, in the below snippet, you’ll be using the column headers only with the column names ignoring the unit of the data (cm). Here, the unit (cm) doesn’t make a big difference.
Snippet
import pandas as pd
from sklearn import datasets
# Load the IRIS dataset
iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data, columns=["sepal_length", "sepal_width", "petal_length", "petal_width"])
df["target"] = iris.target
df.head()
When you print the data, you’ll see the dataframe with the custom headers you’ve used while creating the dataframe.
Dataframe will Look Like
sepal_length | sepal_width | petal_length | petal_width | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Converting Only Specific Columns from Sklearn Dataset
In some scenarios, you may not need all the columns in the sklearn datasets to be available in the pandas dataframe.
In that case, you need to create a pandas dataframe with specific columns from the sklearn datasets.
There is no method directly available to do this. Because, the sklearn datasets returns a bunch object. You cannot retrieve a specific column from it. Hence, first, you need to convert the entire dataset to the dataframe and drop the unnecessary columns or you can only select few columns from the dataframe and create another dataframe.
Snippet
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df = df[["sepal_length", "petal_length"]]
df["target"] = iris.target
df.head()
When you print the dataframe, you’ll see a dataframe with few columns you have selected.
Dataframe will Look Like
sepal_length | petal_length | target | |
---|---|---|---|
0 | 5.1 | 1.4 | 0 |
1 | 4.9 | 1.4 | 0 |
2 | 4.7 | 1.3 | 0 |
3 | 4.6 | 1.5 | 0 |
4 | 5.0 | 1.4 | 0 |
This is how you can convert only specific columns from the sklearn datasets to pandas dataframe.
Conclusion
To summarize, you’ve learned how to convert the sklearn dataset to a pandas dataframe. This is the same for all the datasets you use such as
- Boston house prices dataset
- Iris plants dataset
- Diabetes dataset
- Linnerrud dataset
- Wine recognition dataset
- Breast cancer dataset
- The Olivetti faces dataset
- California Housing dataset
If you’ve any questions, comment below.
You May Also Like
- How to Normalize Data Between 0 and 1 Range?
- How to Convert Numpy Array to Pandas Dataframe
- How to Convert Dictionary To Pandas Dataframe in Python
- How to Convert Pandas Dataframe to Numpy Array
- How to do train test split using sklearn in Python
- How to Plot Correlation Matrix in Python
- How to Plot Confusion Matrix in Python
I found this blog to be very simple, easy to understand, and to the point. I would like to thank you for writing this. Hope you write more blogs like this.
Hello Priya,
Thanks for taking time to write your feedback.
We’re glad that you found the blog useful. Definitely, we will keep writing more such tutorials.
Regards,
Vikram