Conditionals

Dylan Cable

2025-03-20

Conditional probabilities and expectations

In machine learning applications, we rarely can predict outcomes perfectly.
The most common reason for not being able to build perfect algorithms is that it is impossible.
To see this, consider that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes.

Conditional probabilities and expectations

Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions).
Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases.
It therefor makes sense to consider conditional probabilities:

\[ \mbox{Pr}(Y=k \mid X_1 = x_1,\dots,X_p=x_p), \, \mbox{for}\,k=1,\dots,K \]

Conditional probabilities

We will also use the following notation for the conditional probability of being class \(k\):

\[ p_k(\mathbf{x}) = \mbox{Pr}(Y=k \mid \mathbf{X}=\mathbf{x}), \, \mbox{for}\, k=1,\dots,K \]

Notice that the \(p_k(\mathbf{x})\) have to add up to 1 for each \(\mathbf{x}\), so once we know \(K-1\), we know all.

Conditional probabilities

When the outcome is binary, we only need to know 1, so we drop the \(k\) and use the notation:

\[p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})\]

Note

Do not be confused by the fact that we use \(p\) for two different things: the conditional probability \(p(\mathbf{x})\) and the number of predictors \(p\).

Conditional probabilities

These probabilities guide the construction of an algorithm that makes the best prediction:

\[\hat{Y} = \max_k p_k(\mathbf{x})\]

In machine learning, we refer to this as Bayes’ Rule.
But this is a theoretical rule since, in practice, we don’t know \(p_k(\mathbf{x}), k=1,\dots,K\).

Conditional probabilities

Estimating these conditional probabilities can be thought of as the main challenge of machine learning.
The better our probability estimates \(\hat{p}_k(\mathbf{x})\), the better our predictor \(\hat{Y}\).

Conditional probabilities

How well we predict depends on two things:

how close are the \(\max_k p_k(\mathbf{x})\) to 1 or 0 (perfect certainty) and
how close our estimates \(\hat{p}_k(\mathbf{x})\) are to \(p_k(\mathbf{x})\).

We can’t do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities.

Conditional expectations

For binary data, you can think of the probability \(\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})\) as the proportion of 1s in the stratum of the population for which \(\mathbf{X}=\mathbf{x}\).

\[ \mbox{E}(Y \mid \mathbf{X}=\mathbf{x})=\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}). \]

As a result, we often only use the expectation to denote both the conditional probability and conditional expectation.

Conditional expectations

Why do we care about the conditional expectation in machine learning?
This is because the expected value has an attractive mathematical property: it minimizes the MSE.

\[ \hat{Y} = \mbox{E}(Y \mid \mathbf{X}=\mathbf{x}) \, \mbox{ minimizes } \, \mbox{E}\{ (\hat{Y} - Y)^2 \mid \mathbf{X}=\mathbf{x} \} \]

Conditional expectations

Due to this property, a succinct description of the main task of machine learning is that we use data to estimate:

\[ f(\mathbf{x}) \equiv \mbox{E}( Y \mid \mathbf{X}=\mathbf{x} ) \]

for any set of features \(\mathbf{x} = (x_1, \dots, x_p)^\top\).
This is easier said than done, since this function can take any shape and \(p\) can be very large.