Visualizations play an essential role in the exploratory data analysis activity of machine learning.
You can plot confusion matrix using the confusion_matrix() method from sklearn.metrics package.
Why Confusion Matrix?
After creating a machine learning model, accuracy is a metric used to evaluate the machine learning model. On the other hand, you cannot use accuracy in every case as it’ll be misleading. Because the accuracy of 99% may look good as a percentage, but consider a machine learning model used for Fraud Detection or Drug consumption detection.
In such critical scenarios, the 1% percentage failure can create a significant impact.
For example, if a model predicted a fraud transaction of 10000$ as Not Fraud, then it is not a good model and cannot be used in production.
In the drug consumption model, consider if the model predicted that the person had consumed the drug but actually has not. But due to the False prediction of the model, the person may be imprisoned for a crime that is not committed actually.
In such scenarios, you need a better metric than accuracy to validate the machine learning model.
This is where the confusion matrix comes into the picture.
In this tutorial, you’ll learn what a confusion matrix is, how to plot confusion matrix for the binary classification model and the multivariate classification model.
What is Confusion Matrix?
Confusion matrix is a matrix that allows you to visualize the performance of the classification machine learning models. With this visualization, you can get a better idea of how your machine learning model is performing.
Creating Binary Class Classification Model
In this section, you’ll create a classification model that will predict whether a patient has breast cancer or not, denoted by output classes True
or False.
The breast cancer dataset is available in the sklearn dataset library.
It contains a total number of 569 data rows. Each row includes 30 numeric features and one output class. If you want to manipulate or visualize the sklearn dataset, you can convert it into pandas dataframe and play around with the pandas dataframe functionalities.
To create the model, you’ll load the sklearn dataset, split it into train and testing set and fit the train data into the KNeighborsClassifier
model.
After creating the model, you can use the test data to predict the values and check how the model is performing.
You can use the actual output classes from your test data and the predicted output returned by the predict()
method to plot the confusion matrix and evaluate the model accuracy.
Use the below snippet to create the model.
Snippet
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as KNN
breastCancer = load_breast_cancer()
X = breastCancer.data
y = breastCancer.target
# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
knn = KNN(n_neighbors = 3)
# train the model
knn.fit(X_train, y_train)
print('Model is Created')
The KNeighborsClassifier model is created for the breast cancer training data.
Output
Model is Created
To test the model created, you can use the test data obtained from the train test split and predict the output. Then, you’ll have the predicted values.
Snippet
y_pred = knn.predict(X_test)
y_pred
Output
array([0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1,
0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,
0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1,
0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
0, 1, 0, 0, 1, 1, 0, 1])
Now use the predicted classes and the actual output classes from the test data to visualize the confusion matrix.
You’ll learn how to plot the confusion matrix for the binary classification model in the next section.
Plot Confusion Matrix for Binary Classes
You can create the confusion matrix using the confusion_matrix() method from sklearn.metrics
package. The confusion_matrix()
method will give you an array that depicts the True Positives, False Positives, False Negatives, and True negatives.
** Snippet**
from sklearn.metrics import confusion_matrix
#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
print(cf_matrix)
Output
[[ 73 7]
[ 7 141]]
Once you have the confusion matrix created, you can use the heatmap()
method available in the seaborn library to plot the confusion matrix.
Seaborn heatmap() method accepts one mandatory parameter and few other optional parameters.
data
– A rectangular dataset that can be coerced into a 2d array. Here, you can pass the confusion matrix you already haveannot=True
– To write the data value in the cell of the printed matrix. By default, this isFalse
.cmap=Blues
– This is to denote the matplotlib color map names. Here, we’ve created the plot using the blue color shades.
The heatmap()
method returns the matplotlib axes that can be stored in a variable. Here, you’ll store in variable ax
. Now, you can set title, x-axis and y-axis labels and tick labels for x-axis and y-axis.
- Title – Used to label the complete image. Use the set_title() method to set the title.
- Axes-labels – Used to name the
x
axis ory
axis. Use the set_xlabel() to set the x-axis label and set_ylabel() to set the y-axis label. - Tick labels – Used to denote the datapoints on the axes. You can pass the tick labels in an array, and it must be in ascending order. Because the confusion matrix contains the values in the ascending order format. Use the xaxis.set_ticklabels() to set the tick labels for x-axis and yaxis.set_ticklabels() to set the tick labels for y-axis.
Finally, use the plot.show() method to plot the confusion matrix.
Use the below snippet to create a confusion matrix, set title and labels for the axis, and set the tick labels, and plot it.
Snippet
import seaborn as sns
ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])
## Display the visualization of the Confusion Matrix.
plt.show()
Output


Alternatively, you can also plot the confusion matrix using the ConfusionMatrixDisplay.from_predictions() method available in the sklearn library itself if you want to avoid using the seaborn.
Next, you’ll learn how to plot a confusion matrix with percentages.
Plot Confusion Matrix for Binary Classes With Percentage
The objective of creating and plotting the confusion matrix is to check the accuracy of the machine learning model. It’ll be good to visualize the accuracy with percentages rather than using just the number. In this section, you’ll learn how to plot a confusion matrix for binary classes with percentages.
To plot the confusion matrix with percentages, first, you need to calculate the percentage of True Positives, False Positives, False Negatives, and True negatives. You can calculate the percentage of these values by dividing the value by the sum of all values.
Using the np.sum()
method, you can sum all values in the confusion matrix.
Then pass the percentage of each value as data to the heatmap()
method by using the statement cf_matrix/np.sum(cf_matrix)
.
Use the below snippet to plot the confusion matrix with percentages.
Snippet
ax = sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
fmt='.2%', cmap='Blues')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])
## Display the visualization of the Confusion Matrix.
plt.show()
Output


Plot Confusion Matrix for Binary Classes With Labels
In this section, you’ll plot a confusion matrix for Binary classes with labels True Positives, False Positives, False Negatives, and True negatives.
You need to create a list of the labels and convert it into an array using the np.asarray()
method with shape 2,2
. Then, this array of labels must be passed to the attribute annot
. This will plot the confusion matrix with the labels annotation.
Use the below snippet to plot the confusion matrix with labels.
Snippet
labels = ['True Neg','False Pos','False Neg','True Pos']
labels = np.asarray(labels).reshape(2,2)
ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])
## Display the visualization of the Confusion Matrix.
plt.show()
Output


Plot Confusion Matrix for Binary Classes With Labels And Percentages
In this section, you’ll learn how to plot a confusion matrix with labels, counts, and percentages.
You can use this to measure the percentage of each label. For example, how much percentage of the predictions are True Positives, False Positives, False Negatives, and True negatives
For this, first, you need to create a list of labels, then count each label in one list and measure the percentage of the labels in another list.
Then you can zip these different lists to create labels. Zipping means concatenating an item from each list and create one list. Then, this list must be converted into an array using the np.asarray()
method.
Then pass the final array to annot
attribute. This will create a confusion matrix with the label, count, and percentage information for each class.
Use the below snippet to visualize the confusion matrix with all the details.
Snippet
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])
## Display the visualization of the Confusion Matrix.
plt.show()
Output


This is how you can create a confusion matrix for the binary classification machine learning model.
Next, you’ll learn about creating a confusion matrix for a classification model with multiple output classes.
Creating Classification Model For Multiple Classes
In this section, you’ll create a classification model for multiple output classes. In other words, it’s also called multivariate classes.
You’ll be using the iris dataset available in the sklearn dataset library.
It contains a total number of 150 data rows. Each row includes four numeric features and one output class. Output class can be any of one Iris flower type. Namely, Iris Setosa, Iris Versicolour, Iris Virginica.
To create the model, you’ll load the sklearn dataset, split it into train and testing set and fit the train data into the KNeighborsClassifier
model.
After creating the model, you can use the test data to predict the values and check how the model is performing.
You can use the actual output classes from your test data and the predicted output returned by the predict()
method to plot the confusion matrix and evaluate the model accuracy.
Use the below snippet to create the model.
Snippet
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as KNN
iris = load_iris()
X = iris.data
y = iris.target
# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
knn = KNN(n_neighbors = 3)
# train th model
knn.fit(X_train, y_train)
print('Model is Created')
Output
Model is Created
Now the model is created.
Use the test data from the train test split and predict the output value using the predict()
method as shown below.
Snippet
y_pred = knn.predict(X_test)
y_pred
You’ll have the predicted output as an array. The value 0, 1, 2 shows the predicted category of the test data.
Output
array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0,
0, 1, 1, 2, 1, 2, 1, 2, 1, 0, 2, 1, 0, 0, 0, 1])
Now, you can use the predicted data available in y_pred
to create a confusion matrix for multiple classes.
Plot Confusion matrix for Multiple Classes
In this section, you’ll learn how to plot a confusion matrix for multiple classes.
You can use the confusion_matrix()
method available in the sklearn library to create a confusion matrix. It’ll contain three rows and columns representing the actual flower category and the predicted flower category in ascending order.
Snippet
from sklearn.metrics import confusion_matrix
#Get the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
print(cf_matrix)
Output
[[23 0 0]
[ 0 19 0]
[ 0 1 17]]
The below output shows the confusion matrix for actual and predicted flower category counts.
You can use this matrix to plot the confusion matrix using the seaborn library, as shown below.
Snippet
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Flower Category')
ax.set_ylabel('Actual Flower Category ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Setosa','Versicolor', 'Virginia'])
ax.yaxis.set_ticklabels(['Setosa','Versicolor', 'Virginia'])
## Display the visualization of the Confusion Matrix.
plt.show()
Output


Plot Confusion Matrix for Multiple Classes With Percentage
In this section, you’ll plot the confusion matrix for multiple classes with the percentage of each output class. You can calculate the percentage by dividing the values in the confusion matrix by the sum of all values.
Use the below snippet to plot the confusion matrix for multiple classes with percentages.
Snippet
ax = sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
fmt='.2%', cmap='Blues')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Flower Category')
ax.set_ylabel('Actual Flower Category ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Setosa','Versicolor', 'Virginia'])
ax.yaxis.set_ticklabels(['Setosa','Versicolor', 'Virginia'])
## Display the visualization of the Confusion Matrix.
plt.show()
Output


Plot Confusion Matrix for Multiple Classes With Numbers And Percentages
In this section, you’ll learn how to plot a confusion matrix with labels, counts, and percentages for the multiple classes.
You can use this to measure the percentage of each label. For example, how much percentage of the predictions belong to each category of flowers.
For this, first, you need to create a list of labels, then count each label in one list and measure the percentage of the labels in another list.
Then you can zip these different lists to create concatenated labels. Zipping means concatenating an item from each list and create one list. Then, this list must be converted into an array using the np.asarray()
method.
This final array must be passed to annot
attribute. This will create a confusion matrix with the label, count, and percentage information for each category of flowers.
Use the below snippet to visualize the confusion matrix with all the details.
Snippet
#group_names = ['True Neg','False Pos','False Neg','True Pos','True Pos','True Pos','True Pos','True Pos','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(3,3)
ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Flower Category')
ax.set_ylabel('Actual Flower Category ');
## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Setosa','Versicolor', 'Virginia'])
ax.yaxis.set_ticklabels(['Setosa','Versicolor', 'Virginia'])
## Display the visualization of the Confusion Matrix.
plt.show()
Output


This is how you can plot a confusion matrix for multiple classes with percentages and numbers.
Plot Confusion Matrix Without Classifier
To plot the confusion matrix without a classifier model, refer to this StackOverflow answer.
Conclusion
To summarize, you’ve learned how to plot a confusion matrix for the machine learning model with binary output classes and multiple output classes.
You’ve also learned how to annotate the confusion matrix with more details such as labels, count of each label, and percentage of each label for better visualization.
If you’ve any questions, comment below.