Discriminant Analysis

Linear Discriminant Classifier

Suppose we have two classes of data $P_1$ and $P_2$ corresponding to genuine and counterfeit currency distributed $f_j(\mathbf x)$, $j = \{1, 2\}$. Letting $Y$ denote classes $Y = 1$ or $Y = 2$. We also assume the prior probabilities of class assignment as $\mathbb P(Y = 1) = \pi_1$, $\mathbb P(Y = 2) = \pi_2$ and $\pi_1 + \pi_2 = 1$.

Given we have observed $\mathbf x$ what is the probability of it belonging to class 1?

$$\mathbb P(Y = 1 \mid X = \mathbf x) = \frac{\mathbb P(Y = 1 \cap X = \mathbf x)}{\mathbb P(X = \mathbf x)} = \frac{f_1(\mathbf x)\pi_1}{f_1(\mathbf x)\pi_1 + f_2(\mathbf x)\pi_2}$$

thus Bayes rule classifies $\mathbf x$ to the class with the highest probability:

$$\varphi(\mathbf x) = \arg \underset{j\in\{1, 2\}}\max \mathbb P(Y = j \mid X = \mathbf x).$$

In other words, we assign $\mathbf x$ to class 1 if $\mathbb P(Y = 1 \mid X = \mathbf x) > \mathbb P(Y = 2 \mid X = \mathbf x)$ or $f_1(\mathbf x)\pi_1 > f_2(\mathbf x)\pi_2$.

Assuming the populations follow a normal distribution with different means but equal covariance, $X \mid Y = j \sim \mathcal N(\mu_, \Sigma)$ We then assign $\mathbf x$ to class 1 if

$$\begin{aligned} &\frac{f_1(\mathbf x)}{f_2(\mathbf x)} > \frac{\pi_2}{\pi_1} \\ &\Rightarrow \log \frac{f_1(\mathbf x)}{f_2(\mathbf x)} + \log \frac{\pi_1}{\pi_2} > 0 \\ &= \log \frac{\pi_1}{\pi_2} + (\mu_1 - \mu_2)^T\Sigma^{-1}\mathbf x - \frac{1}{2} (\mu_1 - \mu_2)^T\Sigma^{-1}(\mu_1 + \mu_2) > 0. \end{aligned}$$

In practice the true mean and covariance is not known for population distributions. Instead we use the sample statistic, $\hat \mu_j = \bar{\mathbf x_j}$ for the mean and $\hat \Sigma_j = \sum_i (x_i - \bar{\mathbf x_j})^T(x_i - \bar{\mathbf x_j}) \,/\, (n_j - 1)$ for the covariance. It is also common to specify the prior probabilites as $\hat \pi_j = n_j \,/\, N$, where $n_j$ is the number of observations in class $j$.

Relaxing Constraints and Generalizing

So far we have looked at the $K =2$ classes case. For $K > 2$ the probability of observation $\mathbf x$ being distributed under class $j$ is

$$\mathbb P(Y = j \mid \mathbf x) = \frac{f_j(x)\pi_j(x)}{\sum_{k=1}^K f_k(\mathbf x)\pi_k(\mathbf x)}$$

If we assume the populations do not have the same covariance $\Sigma_j \neq \Sigma_k$ we get Quadratic Discriminant Analysis. Gaussian assumption can be replaced for non-linear decision boundary and the non-parametric version, Naive Bayes also considers independent conditional probabilities.