Was working on my research with sklearn, but realized that choosing the right evaluation metrics was always a problem to me. If someone asks me ,”does your model performs well?” The first thing in my mind is “accuracy”. Besides the accuracy, there are a lot, depending on your own problem.
First, let us have a look at a motivation example. You train your classifier to classify dogs and cats. So each sample should be given a class, either dog or cat. For simplicity, we set “dog” to be positive, “cat” to be negative. Four situations will happen: true dogs are recognized as dogs (True Positive, TP), true dogs are recognized as cats (False Negative, FN), true cats are recognized as cats (True Negative, TN), true cats are recognized as dogs(False Positive, FP).
In a generalized way for binary classification:
You could also have more than 2 classes to expand the confusion matrix.
Not restricted to binary classification. Intuitively, that is how many of the predictions are right. So in a binary case, it is defined the number of right predicted over the whole population.
Usually, we can plot them together as the precision and recall curve. They are related to each other but there is a trade-off.
By using F1-score/measure, you will get a balanced method of precision and recall. It is defined as:
Similarly, there is G measure:
But, you will notice that they are both not taking TN into consideration. Following methods will do.
Matthews correlation coefficient (MCC)
It is a balanced measurement compared with the above. From the confusion matrix:
MCC ranges from -1 to 1. Larger means the model is better. In sklearn, zero means random guess.
Suppose we have the True list A and Predicted list B:
MCC = -1/3, where accuracy gives 0.5.
As the confusion matrix could be expanded to be multi-class, in sklearn example:
>>> from sklearn.metrics import confusion_matrix >>> y_true = [2, 0, 2, 2, 0, 1] >>> y_pred = [0, 0, 2, 2, 0, 2] >>> confusion_matrix(y_true, y_pred) array([[2, 0, 0], [0, 0, 1], [1, 0, 2]])
Three classes are tagged with 0,1 and 2. Take row one of the confusion matrix for example, [2,0,0]. The first element means “how many true 0 is predicted as 0”, the second means “how many true 0 is predicted as 1”, etc.
In neural networks or other techniques, we talk about “loss functions”. For example, Log loss (also called logistic regression loss or cross-entropy loss) or Hinge loss. The main idea is to calculate a discrepancy between the predicted and true results. In NNs or SVMs, they will be used with some optimization algorithms (say, to find out peak values via SGD) and need to be carefully chosen.
Go to my previous post to find more about energy-based learning.
Consider any regression models, we let to be the true values, to be the predicted values and to be mean true values.
It is defined as follows:
where is the variance of a group of values. Higher values mean better models.
Mean Absolute Erro and Mean Squared Error
They are similar and common methods in evaluations of regression models.
Both of them taking every values into account, and sensative to outliers. If you have an extremet large value in the true dataset, for example, the MAE or MSE will also be very large.
Median Absolute Error
It is a method which robust to outliers.
Also called the coefficient of determination.
The pink line is our estimated regression line. The green line stands for the mean of the true values (yellow dots). So we first sum up all the squared errors between true and predicted values, , divided by the sum of the squared errors between true and true mean , then use 1 to minus. The R-squared value ranges in 0 and 1. Similarly, a higher value means a good model.