2025-03-20
Here we describe ways in which machine learning algorithms are evaluated.
We need to quantify what we mean when we say an algorithm performs better.
We demonstrate with a boring and simple example: how to predict sex using height.
We introduce the caret package, which provides useful functions to facilitate machine learning in R.
We describe caret it in more detail later
createDataPartition
function call to define the training and test sets as follows:Can we do better?
Exploratory data analysis suggests we can because, on average, males are slightly taller than females:
# A tibble: 2 × 3
sex avg sd
<fct> <dbl> <dbl>
1 Female 64.9 3.76
2 Male 69.3 3.61
Male
if height is within two standard deviations from the average male.which is much higher than 0.5.
The cutoff resulting in this accuracy is:
y_hat <- ifelse(test_set$height > best_cutoff, "Male", "Female") |>
factor(levels = levels(test_set$sex))
y_hat <- factor(y_hat)
mean(y_hat == test_set$sex)
[1] 0.806
The estimate of accuracy is biased due to slight over-training.
But ultimately we tested on a dataset that we did not train on.
Male
if the student is taller than 64 inches. Reference
Prediction Female Male
Female 24 15
Male 36 188
This is because the prevalence of males is high.
These heights were collected from three data sciences courses, two of which had higher male enrollment:
So when computing overall accuracy, the high percentage of mistakes made for females is outweighed by the gains in correct calls for men.
This type of bias can actually be a big problem in practice.
If your training data is biased in some way, you are likely to develop algorithms that are biased as well.
The fact that we used a test set does not matter because it is also derived from the original biased dataset.
This is one of the reasons we look at metrics other than overall accuracy when evaluating a machine learning algorithm.
A general improvement to using overall accuracy is to study sensitivity and specificity separately.
Need binary outcome.
Sensitivity is defined as the ability of an algorithm to predict a positive outcome when the actual outcome is positive: \(\hat{y}=1\) when \(y=1\).
Because an algorithm that calls everything positive has perfect sensitivity, this metric on its own is not enough to judge an algorithm.
Specificity, is the ability of an algorithm to not predict a positive \(\hat{y}=0\) when the actual outcome is not a positive \(y=0\).
We can summarize in the following way:
High sensitivity: \(y=1 \implies \hat{y}=1\).
High specificity: \(y=0 \implies \hat{y} = 0\).
Although the above is often considered the definition of specificity, another way to think of specificity is by the proportion of positive calls that are actually positive:
High specificity: \(\hat{y}=1 \implies y=1\).
Actually Positive | Actually Negative | |
---|---|---|
Predicted positive | True positives (TP) | False positives (FP) |
Predicted negative | False negatives (FN) | True negatives (TN) |
Sensitivity is typically quantified by \(TP/(TP+FN)\).
This quantity is referred to as the true positive rate (TPR) or recall.
Specificity is defined as \(TN/(TN+FP)\).
This quantity is also called the true negative rate (TNR).
There is another way of quantifying specificity which is \(TP/(TP+FP)\)
This quantity is referred to as positive predictive value (PPV) and also as precision.
Note that, unlike TPR and TNR, precision depends on prevalence since higher prevalence implies you can get higher precision even when guessing.
The multiple names can be confusing, so we include a table to help us remember the terms.
Measure of | Name_1 | Name_2 | Definition | Probability representation |
---|---|---|---|---|
sensitivity | TPR | Recall | \(\frac{\mbox{TP}}{\mbox{TP} + \mbox{FN}}\) | \(\mbox{Pr}(\hat{Y}=1 \mid Y=1)\) |
specificity | TNR | 1-FPR | \(\frac{\mbox{TN}}{\mbox{TN}+\mbox{FP}}\) | \(\mbox{Pr}(\hat{Y}=0 \mid Y=0)\) |
specificity | PPV | Precision | \(\frac{\mbox{TP}}{\mbox{TP}+\mbox{FP}}\) | \(\mbox{Pr}(Y=1 \mid \hat{Y}=1)\) |
confusionMatrix
computes all these metrics:Because prevalence is low, failing to predict actual females as females (low sensitivity) does not lower the overall accuracy as much as failing to predict actual males as males (low specificity).
This is an example of why it is important to examine sensitivity and specificity and not just accuracy.
Before applying this algorithm to general datasets, we need to ask ourselves if prevalence will be the same.
\[ \frac{1}{\frac{1}{2}\left(\frac{1}{\mbox{recall}} + \frac{1}{\mbox{precision}}\right) } \]
\[ 2 \times \frac{\mbox{precision} \cdot \mbox{recall}} {\mbox{precision} + \mbox{recall}} \]
\[ \frac{1}{\frac{\beta^2}{1+\beta^2}\frac{1}{\mbox{recall}} + \frac{1}{1+\beta^2}\frac{1}{\mbox{precision}} } \]
The F_meas
function in the caret package computes this summary with beta
defaulting to 1.
Let’s rebuild our prediction algorithm, but this time maximizing the F-score instead of overall accuracy:
y_hat <- ifelse(test_set$height > best_cutoff, "Male", "Female") |>
factor(levels = levels(test_set$sex))
sensitivity(data = y_hat, reference = test_set$sex)
[1] 0.633
[1] 0.857
Up to now we have described evaluation metrics that apply exclusively to categorical data.
Specifically, for binary outcomes, we have described how sensitivity, specificity, accuracy, and \(F_1\) can be used as quantification.
However, these metrics are not useful for continuous outcomes.
In this section, we describe how the general approach to defining “best” in machine learning is to define a loss function, which can be applied to both categorical and continuous data.
\[ \text{MSE} \equiv \mbox{E}\{(\hat{Y} - Y)^2 \} \]
\[ \hat{\mbox{MSE}} = \frac{1}{N}\sum_{i=1}^N (\hat{y}_i - y_i)^2 \]
Note
In practice, we often report the root mean squared error (RMSE), which is simply \(\sqrt{\mbox{MSE}}\), because it is in the same units as the outcomes.
The estimate \(\hat{\text{MSE}}\) is a random variable.
\(\text{MSE}\) and \(\hat{\text{MSE}}\) are often referred to as the true error and apparent error, respectively.
It is difficult to derive the statistical properties of how well the apparent error estimates the true error.
We later introduce cross-validation an approach to estimating the MSE.
There are loss functions other than the squared loss.
For example, the Mean Absolute Error uses absolute values, \(|\hat{Y}_i - Y_i|\) instead of squaring the errors.
\((\hat{Y}_i - Y_i)^2\).
However, in this book we focus on minimizing square loss since it is the most widely used.