Understanding the AUROC metric

Originally posted 2019-10-29

Obligatory disclaimer: all opinions are mine and not of my employer

In this essay I’ll dig into the difficulties of comparing two binary classifiers.

Classification is the task of predicting which class an instance belongs to. Many different kinds of classifiers exist - random forests and neural networks are two of the most popular in the machine learning world, but there are many others out there. How do we know which one is the best?

As it turns out, comparing classifiers is surprisingly hard. You would think that just comparing accuracy would be the easy answer, but unfortunately accuracy has a number of disadvantages. For example, if you were trying to detect a rare disease that occurs in 1% of people, and your classifier perfectly detected the rare disease (but accidentally marked 2% of healthy people as diseased), then the accuracy of that classifier would be about 98%. A classifier that only ever says “no disease” would have an accuracy of 99%. You would probably agree that the first classifier is much more useful than the second one, and yet the second classifier has the higher accuracy.

Since accuracy doesn’t really work that well, people look at a variety of other metrics, like recall (what percent of diseases did we catch?) and precision (what percent of disease classifications were correct?) and so on. Each has strengths and weaknesses. You can find the full matrix of possible metrics on Wikipedia, and although I’ve studied this page thoroughly, the truth is that my eyes glaze over pretty quickly. It doesn’t help that multiple fields have reinvented these concepts independently and each field has invented their own names for the same thing - sensitivity, specificity, recall, precision, PPV, NPV, etc.. “Type I error” and “Type II error” definitely take the cake for worst naming ever. I’ve attended a number of talks given by statisticians to machine learning experts (and vice versa), and the amount of confusion generated by bad naming of such basic ideas is simultaneously hilarious and saddening.

In practice, the metric that many studies use is AUROC (synonyms: AUC, ROC). Unfortunately, the ROC is defined by referring to true positive rates and false positive rates and false negative rates and all those things that make your eyes glaze over. (Pop quiz: can you recall what the X and Y axes of the ROC curve are, off the top of your head?)

So instead, let’s try and figure out a better metric by starting from scratch. (Spoiler: we’re going to rederive AUROC).

Counting Mistakes

One way to think of a classifier is as a sorting algorithm. Given a set of patients who may or may not have a disease, the classifier gives a score to each patient, and then we can order the patients from “least likely to have disease” to “most likely to have disease”. The raw output of the classifier is unimportant, other than providing a way to create an ordinal ranking of patients.

We’re throwing away some information here by converting scores into ranks. But since a classifier’s prediction score needs to be rescaled and calibrated, throwing away the exact scores is not a big deal. And ranking makes sense from the doctor’s perspective, since they can then go down the list from most likely to least likely, until they decide that the marginal gain isn’t worth the time.

A perfect ranking would look something like H H H H H H H D D D D (H = healthy, D = diseased), a less-than-perfect ranking would look like H H D H H H H D D H D, and the worst possible ranking would be D D D D H H H H H H H - a completely backwards ranking. So we’d like a metric that gives the best ranking a score of 1, and the worst ranking a score of 0, and everything else somewhere in between. We’d expect a random ordering to have a score of 0.5, since it actually takes some skill to get the ranking exactly backwards.

Here’s a simple idea: let’s count the number of mistakes the model makes. Every time the model ranks a healthy person as “more likely to have a disease” over someone who actually has the disease, that’s a mistake. The classifier’s overall score will then be one minus the fraction of potential mistakes that it actually made.

Ranking      | Mistakes | Score
--------------------------------
(H, H, D, D) |  0 / 4   |  1
(H, D, H, D) |  1 / 4   |  0.75
(H, D, D, H) |  2 / 4   |  0.5
(D, H, H, D) |  2 / 4   |  0.5
(D, H, D, H) |  3 / 4   |  0.25
(D, D, H, H) |  4 / 4   |  0

Implementation

def fraction_correct(rearrangement):
  positive_indices = []
  negative_indices = []
  for i, item in enumerate(rearrangement):
    if item == 0:
      negative_indices.append(i)
    else:
      positive_indices.append(i)
  # Uses the punning of True / 1, False / 0 in Python
  correct_pairings = sum(p > n for p, n in itertools.product(
      positive_indices, negative_indices))
  total_pairings = len(positive_indices) * len(negative_indices)
  return correct_pairings / total_pairings

As mentioned earlier, it turns out that this function yields identical results as sklearn’s AUC implementation.

>>> data = [0, 0, 0, 1, 1]
>>> scores = list(range(len(data)))
>>> for rearrangement in sorted(set(itertools.permutations(data))):
...     print("%s %0.3f %0.3f %0.3f" % (
    rearrangement,
    sklearn.metrics.roc_auc_score(rearrangement, scores),
    fraction_correct(rearrangement)))

(0, 0, 0, 1, 1) 1.000 1.000
(0, 0, 1, 0, 1) 0.833 0.833
(0, 0, 1, 1, 0) 0.667 0.667
(0, 1, 0, 0, 1) 0.667 0.667
(0, 1, 0, 1, 0) 0.500 0.500
(0, 1, 1, 0, 0) 0.333 0.333
(1, 0, 0, 0, 1) 0.500 0.500
(1, 0, 0, 1, 0) 0.333 0.333
(1, 0, 1, 0, 0) 0.167 0.167
(1, 1, 0, 0, 0) 0.000 0.000

Why does this work?

I think it’s pretty surprising that these two scores turn out to measure the same thing. One is a statement about discrete probabilities - “what’s the chance that a random pairwise comparison is correctly ordered?”. The other is a statement about continuous curves - “the area under the curve plotting true positive rate vs. false positive rate”.

To explain this equivalence, let’s go back to our pairwise comparison algorithm. The implementation as written is \(O(N^2)\), because it constructs a list of positive and negative examples, and then checks all pairwise comparisons. It’s a bit inefficient, especially if you want to evaluate the performance of your classifier over a large dataset. There’s actually an \(O(N)\) algorithm: keep a running counter of positive examples, and every time you see a negative example, you know you’ve just made that many mistakes. Codifying this idea, we arrive at the following:

def fraction_correct_optimized(rearrangement):
  negative_seen = 0
  positive_seen = 0
  mistakes_seen = 0
  for item in rearrangement:
    if item == 0:
      negative_seen += 1
      mistakes_seen += positive_seen
    else:
      positive_seen += 1
  fraction_mistakes = mistakes_seen / (negative_seen * positive_seen)
  return 1 - fraction_mistakes

We can verify that this also returns identical results to our \(O(N^2)\) implementation and the reference AUC implementation.

So the funny thing about this implementation is that it looks like the rectangular approximation to the integral of a curve. Let me rename the variables to make it a bit more obvious:

X, Y = 0, 1
def area_under_curve(steps):
  current_x = 0
  current_y = 0
  area = 0
  for step_direction in steps:
    if step_direction == X:
      current_x += 1
      area += current_y
    if step_direction == Y:
      current_y += 1
  normalized_area = area / (current_x * current_y)
  return 1 - normalized_area

So the x-axis = true negatives seen, and y-axis = true positives seen. We can flip one of these two axes to get rid of the “1 minus area” in the last line of code; therefore, the ROC curve plots true positive rate as a function of (1 - true negative rate), aka false positive rate. After having gone through this derivation, it’s much easier to remember what the ROC’s axes are.