Balanced accuracy is a metric that one can use when evaluating how good a binary classifier is. It is especially useful when the classes are imbalanced, i.e. one of the two classes appears a lot more often than the other. This happens often in many settings such as anomaly detection and the presence of a disease.
As with all discussions on the performance of a binary classifier, we start with a confusion matrix:
In the above, the “positive” or “negative” in TP/FP/TN/FN refers to the prediction made, not the actual class. (Hence, a “false positive” is a case where we wrongly predicted positive.)
Balanced accuracy is based on two more commonly used metrics: sensitivity (also known as true positive rate or recall) and specificity (also known as true negative rate, or 1 – false positive rate). Sensitivity answers the question: “How many of the positive cases did I detect?” Or to put it in a manufacturing setting: “How many (truly) defective products did I manage to recall?” Specificity answers that same question but for the negative cases. Here are the formulas for sensitivity and specificity in terms of the confusion matrix:
Balanced accuracy is simply the arithmetic mean of the two:
Let’s use an example to illustrate how balanced accuracy can be a better judge of performance in the imbalanced class setting. Assume that we have a binary classifier and it gave us the results in the confusion matrix below:
The accuracy of this classifier, i.e. the proportion of correct predictions, is . That sounds really impressive until you realize that simply by predicting all negative, we would have obtained an accuracy of , which is better than our classifier!
Balanced accuracy attempts to account for the imbalance in classes. Here is the computation for balanced accuracy for our classifier:
Our classifier is doing a great job at picking out the negatives but not so for the positives. Balanced accuracy still seems a little high if identifying the positives is what we care about, but it’s much lower than what accuracy suggested.
For comparison, let’s do the computation for the classifier that always predicts 0 (negative):
Based on balanced accuracy, we would say that our classifier is doing a little better than the naive “all negatives” classifier, but not much better. This seems like a reasonable conclusion since our classifier is able to pick out some positives but not many of them.
Here is some R code that you can use to compute these measures:
TP <- 0 TN <- 10050 FP <- 0 FN <- 15 # metrics accuracy <- (TP + TN) / (TP + TN + FP + FN) sensitivity <- TP / (TP + FN) specificity <- TN / (TN + FP) balanced_accuracy <- (sensitivity + specificity) / 2 # print out metrics options(digits = ) cat("Accuracy:", accuracy, "\n", "Sensitivity:", sensitivity, "\n", "Specificity:", specificity, "\n", "Balanced accuracy:", balanced_accuracy)
Note: This reference points out that balanced accuracy can be extended easily to the multi-class setting: there it is simply the arithmetic mean of the recall for all the classes.
Note: Another popular metric one can use for imbalanced datasets is the F1 score, which is the harmonic mean of precision and recall.