Is your model good enough? Evaluation metrics in Classification and Regression

Was working on my research with sklearn, but realized that choosing the right evaluation metrics was always a problem to me. If someone asks me ,”does your model performs well?” The first thing in my mind is “accuracy”. Besides the accuracy, there are a lot, depending on your own problem.

Classification Metrics

Confusion Matrix

First, let us have a look at a motivation example. You train your classifier to classify dogs and cats. So each sample should be given a class, either dog or cat. For simplicity, we set “dog” to be positive, “cat” to be negative. Four situations will happen: true dogs are recognized as dogs (True Positive, TP), true dogs are recognized as cats (False Negative, FN), true cats are recognized as cats (True Negative, TN), true cats are recognized as dogs(False Positive, FP).

In a generalized way for binary classification:

You could also have more than 2 classes to expand the confusion matrix.


Not restricted to binary classification. Intuitively, that is how many of the predictions are right. So in a binary case, it is defined the number of right predicted over the whole population.

Accuracy = \frac{TP+TN}{TP+FP+TN+FN}


Precision = \frac{TP}{TP+FP}
Recall = \frac{TP}{TP+FN}

Usually, we can plot them together as the precision and recall curve. They are related to each other but there is a trade-off.
By using F1-score/measure, you will get a balanced method of precision and recall. It is defined as:

F1-Score = \frac{2Precision \cdot Recall}{Precision+Recall}=\frac{2TP}{2TP+FP+FN}

Similarly, there is G measure:

G-Score = \sqrt{Precision \cdot Recall}=\frac{TP}{\sqrt{(TP+FP)(TP+FN)}}

But, you will notice that they are both not taking TN into consideration. Following methods will do.

Matthews correlation coefficient (MCC)

It is a balanced measurement compared with the above. From the confusion matrix:

MCC = \frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

MCC ranges from -1 to 1. Larger means the model is better. In sklearn, zero means random guess.
Suppose we have the True list A and Predicted list B:
MCC = -1/3, where accuracy gives 0.5.

For Multi-classification

As the confusion matrix could be expanded to be multi-class, in sklearn example:

>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

Three classes are tagged with 0,1 and 2. Take row one of the confusion matrix for example, [2,0,0]. The first element means “how many true 0 is predicted as 0”, the second means “how many true 0 is predicted as 1”, etc.

Loss Functions

In neural networks or other techniques, we talk about “loss functions”. For example, Log loss (also called logistic regression loss or cross-entropy loss) or Hinge loss. The main idea is to calculate a discrepancy between the predicted and true results. In NNs or SVMs, they will be used with some optimization algorithms (say, to find out peak values via SGD) and need to be carefully chosen.
Go to my previous post to find more about energy-based learning.

Regression Metrics

Consider any regression models, we let y to be the true values, \hat{y} to be the predicted values and \bar{y} to be mean true values.

Explained Variation

It is defined as follows:

EV(y,\hat { y } )=1-\frac { Var{ (y-{ \hat{y}) } } }{ Var{( y) } }

where Var is the variance of a group of values. Higher values mean better models.

Mean Absolute Erro and Mean Squared Error

They are similar and common methods in evaluations of regression models.


Both of them taking every values into account, and sensative to outliers. If you have an extremet large value in the true dataset, for example, the MAE or MSE will also be very large.

Median Absolute Error

It is a method which robust to outliers.


R-squared Error

Also called the coefficient of determination.


The pink line is our estimated regression line. The green line stands for the mean of the true values (yellow dots). So we first sum up all the squared errors between true and predicted values, \sum(y_i-\hat{y_i})^2, divided by the sum of the squared errors between true and true mean \sum(y_i-\bar{y_i})^2, then use 1 to minus. The R-squared value ranges in 0 and 1. Similarly, a higher value means a good model.



Published by Irene

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: