2025-03-20
In machine learning applications, we rarely can predict outcomes perfectly.
The most common reason for not being able to build perfect algorithms is that it is impossible.
To see this, consider that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes.
Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions).
Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases.
It therefor makes sense to consider conditional probabilities:
\[ \mbox{Pr}(Y=k \mid X_1 = x_1,\dots,X_p=x_p), \, \mbox{for}\,k=1,\dots,K \]
\[ p_k(\mathbf{x}) = \mbox{Pr}(Y=k \mid \mathbf{X}=\mathbf{x}), \, \mbox{for}\, k=1,\dots,K \]
\[p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})\]
Note
\[\hat{Y} = \max_k p_k(\mathbf{x})\]
In machine learning, we refer to this as Bayes’ Rule.
But this is a theoretical rule since, in practice, we don’t know \(p_k(\mathbf{x}), k=1,\dots,K\).
Estimating these conditional probabilities can be thought of as the main challenge of machine learning.
The better our probability estimates \(\hat{p}_k(\mathbf{x})\), the better our predictor \(\hat{Y}\).
How well we predict depends on two things:
We can’t do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities.
\[ \mbox{E}(Y \mid \mathbf{X}=\mathbf{x})=\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}). \]
Why do we care about the conditional expectation in machine learning?
This is because the expected value has an attractive mathematical property: it minimizes the MSE.
\[ \hat{Y} = \mbox{E}(Y \mid \mathbf{X}=\mathbf{x}) \, \mbox{ minimizes } \, \mbox{E}\{ (\hat{Y} - Y)^2 \mid \mathbf{X}=\mathbf{x} \} \]
\[ f(\mathbf{x}) \equiv \mbox{E}( Y \mid \mathbf{X}=\mathbf{x} ) \]
for any set of features \(\mathbf{x} = (x_1, \dots, x_p)^\top\).
This is easier said than done, since this function can take any shape and \(p\) can be very large.