Evaluation Metrics

2025-03-20

Evaluation metrics

Here we describe ways in which machine learning algorithms are evaluated.
We need to quantify what we mean when we say an algorithm performs better.
We demonstrate with a boring and simple example: how to predict sex using height.

Evaluation metrics

We introduce the caret package, which provides useful functions to facilitate machine learning in R.
We describe caret it in more detail later

Evaluation metrics

For our first example, we use the height data provided by the dslabs package.

library(caret) 
library(dslabs)

We start by defining the outcome and predictors.

y <- heights$sex 
x <- heights$height

Training and test sets

set.seed(2007) 
test_index <- createDataPartition(y, times = 1, p = 0.25, list = FALSE)

We can use the result of the createDataPartition function call to define the training and test sets as follows:

test_set <- heights[test_index, ] 
train_set <- heights[-test_index, ]

Overall accuracy

Let’s start by developing the simplest possible machine algorithm: guessing the outcome.

y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE) |> 
  factor(levels = levels(test_set$sex))

The overall accuracy is simply defined as the overall proportion that is predicted correctly:

mean(y_hat == test_set$sex)

[1] 0.479

Overall accuracy

Can we do better?
Exploratory data analysis suggests we can because, on average, males are slightly taller than females:

library(tidyverse) 
heights |> group_by(sex) |> summarize(avg = mean(height), sd = sd(height))

# A tibble: 2 × 3
  sex      avg    sd
  <fct>  <dbl> <dbl>
1 Female  64.9  3.76
2 Male    69.3  3.61

How do we make use of this insight?

Overall accuracy

Let’s try another simple approach: predict Male if height is within two standard deviations from the average male.

y_hat <- factor(ifelse(x > 62, "Male", "Female"), levels(test_set$sex))

The accuracy goes up from 0.50 to about 0.80:

mean(y == y_hat)

[1] 0.793

But can we do even better?

Overall accuracy

Here we examine the accuracy of 10 different cutoffs and pick the one yielding the best result:

cutoff <- seq(61, 70) 
accuracy <- sapply(cutoff, function(x){ 
  y_hat <- factor(ifelse(train_set$height > x, "Male", "Female"), levels = levels(test_set$sex)) 
  mean(y_hat == train_set$sex) 
})

Overall accuracy

We can make a plot showing the accuracy obtained on the training set for males and females:

Overall accuracy

We see that the maximum value is:

max(accuracy)

[1] 0.834

which is much higher than 0.5.
The cutoff resulting in this accuracy is:

best_cutoff <- cutoff[which.max(accuracy)] 
best_cutoff

[1] 64

Overall accuracy

We can now test this cutoff on our test set to make sure our accuracy is not overly optimistic:

y_hat <- ifelse(test_set$height > best_cutoff, "Male", "Female") |>  
  factor(levels = levels(test_set$sex)) 
y_hat <- factor(y_hat) 
mean(y_hat == test_set$sex)

[1] 0.806

The estimate of accuracy is biased due to slight over-training.
But ultimately we tested on a dataset that we did not train on.

Problem

The prediction rule we developed in the previous section predicts Male if the student is taller than 64 inches.

The confusion matrix

cm <- confusionMatrix(data = y_hat, reference = test_set$sex) 
cm$table

          Reference
Prediction Female Male
    Female     24   15
    Male       36  188

If we study this table closely, it reveals a problem.

The confusion matrix

If we compute the accuracy separately we get:

cm$byClass[c("Sensitivity", "Specificity")]

Sensitivity Specificity 
      0.400       0.926

The confusion matrix

This is because the prevalence of males is high.
These heights were collected from three data sciences courses, two of which had higher male enrollment:

cm$byClass["Prevalence"]

Prevalence 
     0.228

So when computing overall accuracy, the high percentage of mistakes made for females is outweighed by the gains in correct calls for men.
This type of bias can actually be a big problem in practice.
If your training data is biased in some way, you are likely to develop algorithms that are biased as well.

The confusion matrix

The fact that we used a test set does not matter because it is also derived from the original biased dataset.
This is one of the reasons we look at metrics other than overall accuracy when evaluating a machine learning algorithm.
A general improvement to using overall accuracy is to study sensitivity and specificity separately.

Sensitivity and specificity

Need binary outcome.
Sensitivity is defined as the ability of an algorithm to predict a positive outcome when the actual outcome is positive: \(\hat{y}=1\) when \(y=1\).
Because an algorithm that calls everything positive has perfect sensitivity, this metric on its own is not enough to judge an algorithm.
Specificity, is the ability of an algorithm to not predict a positive \(\hat{y}=0\) when the actual outcome is not a positive \(y=0\).

Sensitivity and specificity

We can summarize in the following way:
High sensitivity: \(y=1 \implies \hat{y}=1\).
High specificity: \(y=0 \implies \hat{y} = 0\).
Although the above is often considered the definition of specificity, another way to think of specificity is by the proportion of positive calls that are actually positive:
High specificity: \(\hat{y}=1 \implies y=1\).

Sensitivity and specificity

To provide precise definitions, we name the four entries of the confusion matrix:

	Actually Positive	Actually Negative
Predicted positive	True positives (TP)	False positives (FP)
Predicted negative	False negatives (FN)	True negatives (TN)

Sensitivity and specificity

Sensitivity is typically quantified by \(TP/(TP+FN)\).
This quantity is referred to as the true positive rate (TPR) or recall.
Specificity is defined as \(TN/(TN+FP)\).
This quantity is also called the true negative rate (TNR).

Sensitivity and specificity

There is another way of quantifying specificity which is \(TP/(TP+FP)\)
This quantity is referred to as positive predictive value (PPV) and also as precision.
Note that, unlike TPR and TNR, precision depends on prevalence since higher prevalence implies you can get higher precision even when guessing.
The multiple names can be confusing, so we include a table to help us remember the terms.

Sensitivity and specificity

Measure of	Name_1	Name_2	Definition	Probability representation
sensitivity	TPR	Recall	\(\frac{\mbox{TP}}{\mbox{TP} + \mbox{FN}}\)	\(\mbox{Pr}(\hat{Y}=1 \mid Y=1)\)
specificity	TNR	1-FPR	\(\frac{\mbox{TN}}{\mbox{TN}+\mbox{FP}}\)	\(\mbox{Pr}(\hat{Y}=0 \mid Y=0)\)
specificity	PPV	Precision	\(\frac{\mbox{TP}}{\mbox{TP}+\mbox{FP}}\)	\(\mbox{Pr}(Y=1 \mid \hat{Y}=1)\)

Sensitivity and specificity

The caret function confusionMatrix computes all these metrics:

cm$overall["Accuracy"]

Accuracy 
   0.806

cm$byClass[c("Sensitivity","Specificity", "Prevalence")]

Sensitivity Specificity  Prevalence 
      0.400       0.926       0.228

Sensitivity and specificity

Because prevalence is low, failing to predict actual females as females (low sensitivity) does not lower the overall accuracy as much as failing to predict actual males as males (low specificity).
This is an example of why it is important to examine sensitivity and specificity and not just accuracy.
Before applying this algorithm to general datasets, we need to ask ourselves if prevalence will be the same.

Balanced accuracy and \(F_1\) score

\[ \frac{1}{\frac{1}{2}\left(\frac{1}{\mbox{recall}} + \frac{1}{\mbox{precision}}\right) } \]

Balanced accuracy and \(F_1\) score

Because it is easier to write, you often see this harmonic average rewritten as:

\[ 2 \times \frac{\mbox{precision} \cdot \mbox{recall}} {\mbox{precision} + \mbox{recall}} \]

Balanced accuracy and \(F_1\) score

The \(F_1\)-score can be adapted to weigh specificity and sensitivity differently.

\[ \frac{1}{\frac{\beta^2}{1+\beta^2}\frac{1}{\mbox{recall}} + \frac{1}{1+\beta^2}\frac{1}{\mbox{precision}} } \]

Balanced accuracy and \(F_1\) score

The F_meas function in the caret package computes this summary with beta defaulting to 1.
Let’s rebuild our prediction algorithm, but this time maximizing the F-score instead of overall accuracy:

cutoff <- seq(61, 70) 
F_1 <- sapply(cutoff, function(x){ 
  y_hat <- factor(ifelse(train_set$height > x, "Male", "Female"), levels(test_set$sex)) 
  F_meas(data = y_hat, reference = factor(train_set$sex)) 
})

Balanced accuracy and \(F_1\) score

As before, we can plot these \(F_1\) measures versus the cutoffs:

Balanced accuracy and \(F_1\) score

We see that it is maximized at \(F_1\) value of:

max(F_1)

[1] 0.614

This maximum is achieved when we use the following cutoff:

best_cutoff <- cutoff[which.max(F_1)] 
best_cutoff

[1] 66

A cutoff of 66 makes more sense than 64.

Balanced accuracy and \(F_1\) score

Furthermore, it balances the specificity and sensitivity of our confusion matrix:

y_hat <- ifelse(test_set$height > best_cutoff, "Male", "Female") |>  
  factor(levels = levels(test_set$sex)) 
sensitivity(data = y_hat, reference = test_set$sex)

[1] 0.633

specificity(data = y_hat, reference = test_set$sex)

[1] 0.857

We now see that we do much better than guessing, that both sensitivity and specificity are relatively high.

ROC and precision-recall curves

The packages pROC and plotROC are useful for generating these plots.

ROC and precision-recall curves

Mean Squared Error

Up to now we have described evaluation metrics that apply exclusively to categorical data.
Specifically, for binary outcomes, we have described how sensitivity, specificity, accuracy, and \(F_1\) can be used as quantification.
However, these metrics are not useful for continuous outcomes.
In this section, we describe how the general approach to defining “best” in machine learning is to define a loss function, which can be applied to both categorical and continuous data.

Mean Squared Error

Most commont metric to minimize is mean squared error (MSE):

\[ \text{MSE} \equiv \mbox{E}\{(\hat{Y} - Y)^2 \} \]

How do we estimate this?

Mean Squared Error

Because in practice we have tests set with many, say \(N\), independent observations, a commonly used observable estimate of the MSE is:

\[ \hat{\mbox{MSE}} = \frac{1}{N}\sum_{i=1}^N (\hat{y}_i - y_i)^2 \]

with the \(\hat{y}_i\) generated completely independently from the the \(y_i\).

Note

In practice, we often report the root mean squared error (RMSE), which is simply \(\sqrt{\mbox{MSE}}\), because it is in the same units as the outcomes.

Mean Squared Error

The estimate \(\hat{\text{MSE}}\) is a random variable.
\(\text{MSE}\) and \(\hat{\text{MSE}}\) are often referred to as the true error and apparent error, respectively.
It is difficult to derive the statistical properties of how well the apparent error estimates the true error.
We later introduce cross-validation an approach to estimating the MSE.

Mean Squared Error

There are loss functions other than the squared loss.
For example, the Mean Absolute Error uses absolute values, \(|\hat{Y}_i - Y_i|\) instead of squaring the errors.
\((\hat{Y}_i - Y_i)^2\).
However, in this book we focus on minimizing square loss since it is the most widely used.