stat841f11
Contents
- 1 Proposal for Final Project
- 2 Editor Sign Up
- 3 STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20
- 3.1 Wiki Course Notes
- 3.2 Classification (Lecture: Sep. 20, 2011)
- 3.3 LDA and QDA
- 3.4 LDA and QDA Continued (Lecture: Sep. 22, 2011)
- 3.5 Trick: Using LDA to do QDA
- 3.6 Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011)
- 3.7 Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011)
- 3.8 Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011)
- 3.9 Fisher Discriminant Analysis (FDA) Continued (Lecture: Oct. 04, 2011)
- 3.10 Linear and Logistic Regression (Lecture: Oct. 06, 2011)
- 3.11 Newton-Raphson Method (Lecture: Oct 11, 2011)
- 3.12 Perceptron
- 3.13 Perceptron Learning Algorithm (Lecture: Oct. 13, 2011)
- 3.14 Feed-Forward Neural Networks
- 3.15 Continue back propagation Oct 18, 2011 lecture
- 3.16 References
Proposal for Final Project
Editor Sign Up
STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20
Wiki Course Notes
Students will need to contribute to the wiki for 20% of their grade. Access via wikicoursenote.com Go to editor sign-up, and use your UW userid for your account name, and use your UW email.
primary (10%) Post a draft of lecture notes within 48 hours. You will need to do this 1 or 2 times, depending on class size.
secondary (10%) Make improvements to the notes for at least 60% of the lectures. More than half of your contributions should be technical rather than editorial. There will be a spreadsheet where students can indicate what they've done and when. The instructor will conduct random spot checks to ensure that students have contributed what they claim.
Classification (Lecture: Sep. 20, 2011)
Introduction
Machine learning (ML) methodology in general is an artificial intelligence approach to establish and train a model to recognize the pattern or underlying mapping of a system based on a set of training examples consisting of input and output patterns.
In machine learning, pattern recognition is the assignment of some sort of output value (or label) to a given input value (or instance), according to some specific algorithm. The approach of using examples to produce the output labels is known as learning methodology. When the underlying function from inputs to outputs exists, it is referred to as the target function. The estimate of the target function which is learnt or output by the learning algorithm is known as the solution of learning problem. In case of classification this function is referred to as the decision function.
In the broadest sense, any method that incorporates information from training samples in the design of a classifier employs learning. Learning tasks can be classified along different dimensions. One important dimension is the distinction between supervised and unsupervised learning. In supervised learning a category label for each pattern in the training set is provided. The trained system will then generalize to new data samples. In unsupervised learning , on the other hand, training data has not been labeled, and the system forms clusters or natural grouping of input patterns based on some sort of measure of similarity and it can then be used to determine the correct output value for new data instances.
The first category is known as pattern classification and the second one as clustering. Pattern classification is the main focus in this course.
Classification problem formulation : Suppose that we are given n observations. Each observation consists of a pair: a vector [math]\mathbf{x}_i\subset \mathbb{R}^d, \quad i=1,...,n[/math], and the associated label [math]y_i[/math].
Where [math]\mathbf{x}_i = (x_{i1}, x_{i2}, ... x_{id}) \in \mathcal{X} \subset \mathbb{R}^d[/math] and [math]Y_i[/math] belongs to some finite set [math]\mathcal{Y}[/math].
The classification task is now looking for a function [math]f:\mathbf{x}_i\mapsto y[/math] which maps the input data points to a target value (i.e. class label). Function [math]f(\mathbf{x},\theta)[/math] is defined by a set of parametrs [math]\mathbf{\theta}[/math] and the goal is to train the classifier in a way that among all possible mappings with different parameters the obtained decision boundary gives the minimum classification error.
Definitions
true error rate for classifier [math]h[/math] is the error with respect to the underlying distribution (that we do not know).
[math]L(h) = P(h(X) \neq Y )[/math]
empirical error rate (or training error rate) is the amount of error that our classification function [math]h(x)[/math] makes on the training data.
[math]\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)[/math]
where [math]\mathbf{I}()[/math] is an indicator function. Indicator function is defined by
[math]\mathbf{I}(x) = \begin{cases} 1 & \text{if } x \text{ is true} \\ 0 & \text{if } x \text{ is false} \end{cases}[/math]
So in this case, [math]\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases} 1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. when misclassification happens)} \\ 0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)} \end{cases}[/math]
e.g., 100 new data points with known (true) labels
[math]y_1 = h(x_1)[/math]
...
[math]y_{100} = h(x_{100})[/math]
To calculate the empirical error we count how many labels our function [math]h(x)[/math] assigned incorrectly and divide by n=100
Bayes Classifier
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.
First recall Bayes' Rule, in the format [math]P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} [/math]
P(Y|X) : posterior , probability of [math]Y[/math] given [math]X[/math]
P(X|Y) : likelihood, probability of [math]X[/math] being generated by [math]Y[/math]
P(Y) : prior, probability of [math]Y[/math] being selected
P(X) : marginal, probability of obtaining [math]X[/math]
We will start with the simplest case: [math]\mathcal{Y} = \{0,1\}[/math]
[math] r(x) = P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)} = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}[/math]
Bayes' rule can be approached by computing either:
1) The posterior: [math]\ P(Y=1|X=x) [/math] and [math]\ P(Y=0|X=x) [/math] or
2) The likelihood: [math]\ P(X=x|Y=1) [/math] and [math]\ P(X=x|Y=0) [/math]
The former reflects a Bayesian approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable [math]\ X [/math]) to determine the probability distribution of the parameter of interest (e.g., the random variable [math]\ Y [/math]). The probability, according to Bayesians, is a degree of belief in the parameter of interest taking on a particular value (e.g., [math]\ Y=1 [/math]), given a particular observation (e.g., [math]\ X=x [/math]). Historically, the difficulty in this approach lies with determining the posterior distribution, however, more recent methods such as Markov Chain Monte Carlo (MCMC) allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," Journal of Evaluation in Clinical Practice, 2001</ref>.
The latter reflects a Frequentist approach. The Frequentist approach assumes that the probability distribution, including the mean, variance, etc., is fixed for the parameter of interest (e.g., the variable [math]\ Y [/math], which is not random). The observed data (e.g., the random variable [math]\ X [/math]) is simply a sampling of a far larger population of possible observations. Thus, a certain repeatability or frequency is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called hypothesis testing to compare a null hypothesis (e.g. an assumption that the mean of the probability distribution is [math]\ \mu_0 [/math]) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than [math]\ \mu_0 [/math]) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf </ref>.
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one correct label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes Bayesians and Frequentists use each others' approaches. So using Bayes' rule doesn't necessarily mean you're a Bayesian. Overall, the question remains unresolved.
The Bayes Classifier uses [math]\ P(Y=1|X=x)[/math]
[math] P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}[/math]
P(Y=1) : the prior, based on belief/evidence beforehand
denominator : marginalized by summation
[math]h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} [/math]
The set [math]\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} [/math]
which defines a decision boundary.
[math]h^*(x) = \begin{cases} 1 \ \ if \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ \ \ \ \ otherwise \end{cases} [/math]
Theorem: Bayes rule is optimal, i.e., if [math]h[/math] is any other classification rule, then [math]L(h^*) \lt = L(h)[/math]
Proof: Consider any classifier [math]h[/math]. We can express the error rate as
- [math] P( \{h(X) \ne Y \} ) = E_{X,Y} [ \mathbf{1}_{\{h(X) \ne Y \}} ] = E_X \left[ E_Y[ \mathbf{1}_{\{h(X) \ne Y \}}| X] \right] [/math]
To minimize this last expression, it suffices to minimize the inner expectation. Expanding this expectation:
- [math] E_Y[ \mathbf{1}_{\{h(X) \ne Y \}}| X] = \sum_{y \in Supp(Y)} P( h(X) \ne y | X) \mathbf{1}_{\{h(X) \ne y \} } [/math]
which, in the two-class case, simplifies to
- [math] = P( h(X) \ne 0 | X) \mathbf{1}_{\{h(X) \ne 0 \} } + P( h(X) \ne 1 | X) \mathbf{1}_{\{h(X) \ne 1 \} } [/math]
- [math] = r(X) \mathbf{1}_{\{h(X) \ne 0 \} } + (1-r(X))\mathbf{1}_{\{h(X) \ne 1 \} } [/math]
where [math]r(x)[/math] is defined as above. We should 'choose' h(X) to equal the label that minimizes the sum. Consider if [math]r(X)\gt 1/2 [/math], then [math]r(X)\gt 1-r(X)[/math] so we should let [math]h(X) = 1[/math] to minimize the sum. Thus the Bayes classifier is the optimal classifier.
Why then do we need other classification methods? A: Because X densities are often/typically unknown. I.e., [math]f_k(x)[/math] and/or [math]\pi_k[/math] unknown.
[math]P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}[/math] f_k(x) is referred to as the class conditional distribution (~likelihood).
Therefore, we rely on some data to estimate quantities.
Three Main Approaches
1. Empirical Risk Minimization: Choose a set of classifiers H (e.g., line, neural network) and find [math]h^* \in H[/math] that minimizes (some estimate of) L(h).
2. Regression: Find an estimate ([math]\hat{r}[/math]) of function [math]r[/math] and define [math]h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} [/math]
The [math] 1/2 [/math] in the expression above is a threshold set for the regression prediction output.
In general regression refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).
3. Density Estimation: Estimate [math]P(X=x|Y=0)[/math] from [math]X_i[/math]'s for which [math]Y_i = 0[/math] Estimate [math]P(X=x|Y=1)[/math] from [math]X_i[/math]'s for which [math]Y_i = 1[/math] and let [math]P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i[/math]
Define [math]\hat{r}(x) = \hat{P}(Y=1|X=x)[/math] and [math]h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} [/math]
It is possible that there may not be enough data to estimate from for density estimation. But the main problem lies with high dimensional spaces, as the estimation results may not be good (high error rate) and sometimes even infeasible. The term curse of dimensionality was coined by Bellman <ref>R. E. Bellman, Dynamic Programming. Princeton University Press, 1957</ref> to describe this problem.
As the dimension of the space goes up, the learning requirements go up exponentially.
To Learn more about methods for handling high-dimensional data <ref> https://docs.google.com/viewer?url=http%3A%2F%2Fwww.bios.unc.edu%2F~dzeng%2FBIOS740%2Flecture_notes.pdf</ref>
Multi-Class Classification
Generalize to case Y takes on k>2 values.
Theorem: [math]Y \in \mathcal{Y} = \{1,2,..., k\} [/math] optimal rule
[math]\ h^{*}(x) = argmax_k P(Y=k|X=x) [/math]
where [math]P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r \pi_r}[/math]
Examples of Classification
- Face detection in images.
- Medical diagnosis.
- Detecting credit card fraud (fraudulent or legitimate).
- Speech recognition.
- Handwriting recognition.
LDA and QDA
Discriminant function analysis finds features that best allow discrimination between two or more classes. The approach is similar to analysis of Variance (ANOVA) in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). Electronic Statistics Textbook. [Online]. Available: http://www.statsoft.com/textbook/discriminant-function-analysis/. </ref>. Linear discriminant analysis (LDA) and Quadratic discriminant analysis (QDA) are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. Fisher discriminant analysis (FDA) another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.
LDA
The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.
[math]\mathcal{Y} = \{ 0,1 \}[/math] assumed (i.e., 2 labels)
[math]h(x) = \begin{cases} 1 \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ otherwise \end{cases} [/math]
[math]P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ [/math] (denom = P(x))
1) Assume Gaussian distributions
[math]f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} \text{exp}\big(-\frac{1}{2}(\mathbf{x-\mu_k}) \Sigma_k^{-1}(\mathbf{x-\mu_k}) )[/math]
must compare [math]\frac{f_1(x) \pi_1} {p(x)}[/math] with [math]\frac{f_0(x) \pi_0} {p(x)}[/math] Note that the p(x) denom can be ignored: [math]f_1(x) \pi_1[/math] with [math]f_0(x) \pi_0 [/math]
To find the decision boundary, set [math]f_1(x) \pi_1 = f_0(x) \pi_0 [/math]
[math] \frac{1}{(2\pi)^{d/2} |\Sigma_1|^{1/2}} exp(-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma_1^{-1}(\mathbf{x-\mu_1}) )\pi_1 = \frac{1}{(2\pi)^{d/2} |\Sigma_0|^{1/2}} exp(-\frac{1}{2}(\mathbf{x -\mu_0}) \Sigma_0^{-1}(\mathbf{x-\mu_0}) )\pi_0[/math]
2) Assume [math]\Sigma_1 = \Sigma_0[/math], we can use [math]\Sigma = \Sigma_0 = \Sigma_1[/math].
[math] \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} exp(-\frac{1}{2}(\mathbf{x -\mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )\pi_1 = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} exp(-\frac{1}{2}(\mathbf{x- \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )\pi_0[/math]
3) Cancel [math](2\pi)^{-d/2} |\Sigma|^{-1/2}[/math] from both sides.
[math] exp(-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )\pi_1 = exp(-\frac{1}{2}(\mathbf{x - \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )\pi_0[/math]
4) Take log of both sides.
[math] -\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )+ \text{log}(\pi_1) = -\frac{1}{2}(\mathbf{x - \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )+ \text{log}(\pi_0)[/math]
5) Subtract one side from both sides, leaving zero on one side.
[math]-\frac{1}{2}(\mathbf{x - \mu_1})^T \Sigma^{-1} (\mathbf{x-\mu_1}) + \text{log}(\pi_1) - [-\frac{1}{2}(\mathbf{x - \mu_0})^T \Sigma^{-1} (\mathbf{x-\mu_0}) + \text{log}(\pi_0)] = 0 [/math]
[math]\frac{1}{2}[-\mathbf{x}^T \Sigma^{-1}\mathbf{x - \mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}
+ \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]
+ \text{log}(\frac{\pi_1}{\pi_0}) = 0 [/math]
Cancelling out the terms quadratic in [math]\mathbf{x}[/math] and rearranging results in
[math]\frac{1}{2}[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} + (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}] + \text{log}(\frac{\pi_1}{\pi_0}) = 0 [/math]
We can see that the first pair of terms is constant, and the second pair is linear in x.
Therefore, we end up with something of the form
[math]ax + b = 0[/math].
For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref>
LDA and QDA Continued (Lecture: Sep. 22, 2011)
If we relax assumption 2 (i.e. [math]\Sigma_1 \neq \Sigma_0[/math]) then we get a quadratic equation that can be written as [math]{x}^Ta{x}+b{x} + c = 0[/math]
Generalizing LDA and QDA
Theorem:
Suppose that [math]\,Y \in \{1,\dots,K\}[/math], if [math]\,f_k(x) = Pr(X=x|Y=k)[/math] is Gaussian, the Bayes Classifier rule is
- [math]\,h^*(x) = \arg\max_{k} \delta_k(x)[/math]
Where
[math] \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) [/math]
When the Gaussian variances are equal [math]\Sigma_1 = \Sigma_0[/math] (e.g. LDA), then
[math] \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) [/math]
(To compute this, we need to calculate the value of [math]\,\delta [/math] for each class, and then take the one with the max. value).
In practice
We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.
[math]\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}[/math]
The mean to be the average item in set k, e.g.
[math]\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i[/math]
and calculate the covariance of each class e.g.
[math]\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top[/math]
If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.
[math]\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r} [/math]
Where: [math]\,n_r[/math] is the number of data points in class [math]\,r[/math], [math]\,\Sigma_r[/math] is the covariance of class [math]\,r[/math], [math]\,n[/math] is the total number of data points, and [math]\,k[/math] is the number of classes.
Computation
For QDA we need to calculate: [math] \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) [/math]
Lets first consider when [math]\, \Sigma_k = I, \forall k [/math]. This is the case where each distribution is spherical, around the mean point.
Case 1
When [math]\, \Sigma_k = I [/math]
We have:
[math] \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) [/math]
but [math]\ \log(|I|)=\log(1)=0 [/math]
and [math]\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) [/math] is the squared Euclidean distance between two points [math]\,x[/math] and [math]\,\mu_k[/math]
Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.
Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.
Case 2
When [math]\, \Sigma_k \neq I [/math]
Using the Singular Value Decomposition (SVD) of [math]\, \Sigma_k[/math] we get [math] \, \Sigma_k = U_kS_kV_k^\top[/math]. In particular, [math]\, U_k[/math] is a collection of eigenvectors of [math]\, \Sigma_k\Sigma_k^*[/math], and [math]\, V_k[/math] is a collection of eigenvectors of [math]\,\Sigma_k^*\Sigma_k[/math]. Since [math]\, \Sigma_k[/math] is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, [math]\, \Sigma_k = \Sigma_k^*[/math], so we have [math] \, \Sigma_k = U_kS_kU_k^\top [/math].
For [math]\,\delta_k[/math], the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," Proceedings of the National Institute of Sciences of India, 1936</ref> :
- [math]\begin{align} (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\ & = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\ & = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\ & = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\ & = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\ \end{align} [/math]
If we think of [math] \, S_k^{-\frac{1}{2}}U_k^\top [/math] as a linear transformation that takes points in class [math]\,k[/math] and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified [math]\,\delta_k[/math] values to calculate [math]\ h^*(\,x)[/math]. After applying the singular value decomposition, [math]\,\Sigma_k^{-1}[/math] is considered to be an identity matrix such that
[math] \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k)] + log (\pi_k) [/math]
and,
[math]\ \log(|I|)=\log(1)=0 [/math]
For applying the above method with classes that have different covariance matrices (for example the covariance matrices [math]\ \Sigma_0 [/math] and [math]\ \Sigma_1 [/math] for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to [math]\ \mu_0 [/math] and the new data point would also have to be transformed by the class 2 transformation and then compared to [math]\ \mu_1 [/math]).
The difference between Case 1 and Case 2 (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below.
As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in Case 2, whereas the Euclidean distance applies to the special case in Case 1 where the data distribution is assumed to be spherical.
Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.
The Number of Parameters in LDA and QDA
Both LDA and QDA require us to estimate some parameters. Here is a comparison between the number of parameters needed to be estimated for LDA and QDA:
LDA: Since we just need to compare the differences between one given class and remaining [math]\,K-1[/math] classes, totally, there are [math]\,K-1[/math] differences. For each of them, [math]\,a^{T}x+b[/math] requires [math]\,d+1[/math] parameters. Therefore, there are [math]\,(K-1)\times(d+1)[/math] parameters.
QDA: For each of the differences, [math]\,x^{T}ax + b^{T}x + c[/math] requires [math]\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1[/math] parameters. Therefore, there are [math](K-1)(\frac{d(d+3)}{2}+1)[/math] parameters.
Trick: Using LDA to do QDA
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the Kernel trick that will be discussed later in the course.
In this approach the feature vector is augmented with the quadratic terms (i.e. new dimensions are introduced) where the original data will be projected to that dimensions. We then apply LDA on the new higher-dimensional data.
The motivation behind this approach is to take advantage of the fact that fewer parameters have to be calculated in LDA , as explained in previous sections, and therefore have a more robust system in situations where we have fewer data points.
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate [math]\,\mu_1[/math], [math]\,\mu_2[/math] and [math]\,\Sigma[/math]. In QDA we must estimate all of those, plus another [math]\,\Sigma[/math]; the extra [math]\,\frac{d(d-1)}{2}[/math] estimations make QDA less robust with fewer data points.
Theoretically
Suppose we have a quadratic function to estimate: [math]g(\mathbf{x}) = y = \mathbf{x}^T\mathbf{v}\mathbf{x} + \mathbf{w}^T\mathbf{x}[/math].
Using this trick, we introduce two new vectors, [math]\,\hat{\mathbf{w}}[/math] and [math]\,\hat{\mathbf{x}}[/math] such that:
[math]\hat{\mathbf{w}} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]^T[/math]
and
[math]\hat{\mathbf{x}} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]^T[/math]
We can then apply LDA to estimate the new function: [math]\hat{g}(\mathbf{x},\mathbf{x}^2) = \hat{y} =\hat{\mathbf{w}}^T\hat{\mathbf{x}}[/math].
Note that we can do this for any [math]\, x[/math] and in any dimension; we could extend a [math]D \times n[/math] matrix to a quadratic dimension by appending another [math]D \times n[/math] matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a [math]\,sin(x)[/math] dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.
Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011)
Principal Component Analysis (PCA) is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.
The new variables that form a new coordinate system are called principal components (PCs). PCs are denoted by [math]\ u_1, u_2, ... , u_D [/math]. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, [math]\ u_1, u_2, ... , u_d [/math], to approximate the space spanned by the original data points [math]\ x_1, x_2, ... , x_D [/math]. We can choose d based on what percentage of the original data we would like to maintain.
Let [math]\ PC_j[/math] be a linear combination of [math]\ x_1, x_2, ... , x_D [/math] defined by the coefficients [math]\ w^{(j)}[/math] = [math] ( {w_1}^{(j)}, {w_2}^{(j)},...,{w_D}^{(j)} )^T [/math]
Thus, [math] u_j = {w_1}^{(j)} x_1 + {w_2}^{(j)} x_2 + ... + {w_D}^{(j)} x_D = w^{(j)^T} X [/math]
This is a unique configuration since it sets up the PCs in order from maximum to minimum variances. The first PC, [math]\ u_1 [/math] is called first principal component and has the maximum variance, thus it accounts for the most significant variance in the data [math]\ x_1, x_2, ... , x_D [/math]. The second PC, [math]\ u_2 [/math] is called second principal component and has the second highest variance and so on until PC, [math]\ u_D [/math] which has the minimum variance. The first principal component usually stands for the most interesting feature that we'd like to concern about.
To get the first principal component, we would like to use the following equation:
[math]\ max (Var(w^T X)) = max (w^T S w) [/math]
Where [math]\ S [/math] is the covariance matrix. And we solve for [math]\ w [/math].
Note: we require the constraint [math]\ w^T w = 1 [/math] because if there is no constraint on the length of [math]\ w [/math] then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found.
Lagrange Multiplier
Before we proceed, we should review Lagrange multipliers.
Lagrange multipliers are used to find the maximum or minimum of a function [math]\displaystyle f(x,y)[/math] subject to constraint [math]\displaystyle g(x,y)=0[/math]
we define a new constant [math] \lambda[/math] called a Lagrange Multiplier and we form the Lagrangian,
[math]\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)[/math]
If [math]\displaystyle f(x^*,y^*)[/math] is the max of [math]\displaystyle f(x,y)[/math], there exists [math]\displaystyle \lambda^*[/math] such that [math]\displaystyle (x^*,y^*,\lambda^*) [/math] is a stationary point of [math]\displaystyle L[/math] (partial derivatives are 0).
In addition [math]\displaystyle (x^*,y^*)[/math] is a point in which functions [math]\displaystyle f[/math] and [math]\displaystyle g[/math] touch but do not cross. At this point, the tangents of [math]\displaystyle f[/math] and [math]\displaystyle g[/math] are parallel or gradients of [math]\displaystyle f[/math] and [math]\displaystyle g[/math] are parallel, such that:
[math]\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g[/math]
where,
[math]\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow[/math] the gradient of [math]\, f[/math]
[math]\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow[/math] the gradient of [math]\, g [/math]
Example :
Suppose we want to maximize the function [math]\displaystyle f(x,y)=x-y[/math] subject to the constraint [math]\displaystyle x^{2}+y^{2}=1[/math]. We can apply the Lagrange multiplier method to find the maximum value for the function [math]\displaystyle f [/math]; the Lagrangian is:
[math]\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)[/math]
We want the partial derivatives equal to zero:
[math]\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 [/math]
[math]\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0[/math]
[math]\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1[/math]
Solving the system we obtain two stationary points: [math]\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)[/math] and [math]\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)[/math]. In order to understand which one is the maximum, we just need to substitute it in [math]\displaystyle f(x,y)[/math] and see which one as the biggest value. In this case the maximum is [math]\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)[/math].
Determining w :
Use the Lagrange multiplier conversion to obtain: [math]\displaystyle L(w, \lambda) = w^T Sw - \lambda (w^T w - 1)[/math] where [math]\displaystyle \lambda [/math] is a constant
Take the derivative and set it to zero: [math]\displaystyle{\partial L \over{\partial w}} = 0 [/math]
To obtain:
[math]\displaystyle 2Sw - 2 \lambda w = 0[/math]
Rearrange to obtain:
[math]\displaystyle Sw = \lambda w[/math]
where [math]\displaystyle w[/math] is eigenvector of [math]\displaystyle S [/math] and [math]\ \lambda [/math] is the eigenvalue of [math]\displaystyle S [/math] as [math]\displaystyle Sw= \lambda w [/math] , and [math]\displaystyle w^T w=1[/math] , then we can write
[math]\displaystyle w^T Sw= w^T\lambda w= \lambda w^T w =\lambda [/math]
Note that the PCs decompose the total variance in the data in the following way :
[math] \sum_{i=1}^{D} Var(u_i) [/math]
[math]= \sum_{i=1}^{D} (\lambda_i) [/math]
[math]\ = Tr(S) [/math] ---- (S is a co-variance matrix, and therefore it's symmetric)
[math]= \sum_{i=1}^{D} Var(x_i)[/math]
Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011)
As can be seen from the above expressions, [math]\ Var(W^\top X) = W^\top S W= \lambda [/math] where lambda is an eigenvalue of the sample covariance matrix [math]\ S [/math] and [math]\ W[/math] is its corresponding eigenvector. So [math]\ Var(u_i) [/math] is maximized if [math]\ \lambda_i [/math] is the maximum eigenvalue of [math]\ S [/math] and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of [math]\ S[/math]<ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:
[math]\ \lambda_1 \geq ... \geq \lambda_D [/math]
such that
[math]\ Var(u_1) \geq ... \geq Var(u_D) [/math]
Alternative Derivation
Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared reconstruction error. The squared reconstruction error is the difference between the original data set [math]\ X [/math] and the new data set [math] \hat{X} [/math] obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.
Reconstruction Error
[math] e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2 [/math]
Minimize Reconstruction Error
Suppose [math] \bar{x} = 0 [/math] where [math] \hat{x}_i = x_i - \bar{x} [/math]
Let [math]\ f(y) = U_d y [/math] where [math]\ U_d [/math] is a D by d matrix with d orthogonal unit vectors as columns.
Fit the model to the data and minimize the reconstruction error:
[math]\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2 [/math]
Differentiate with respect to [math]\ y_i [/math]:
[math] \frac{\partial e}{\partial y_i} = 0 [/math]
we can rewrite reconstruction-error as : [math]\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i) [/math]
[math]\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0 [/math]
since [math]\ U_d(x_i - U_d y_i) [/math] is a linear combination of the columns of [math]\ U_d [/math],
which are independent (orthogonal to each other) we can conclude that:
[math]\ x_i - U_d y_i = 0 [/math] or equivalently,
[math]\ x_i = U_d y_i [/math]
[math]\ y_i = U_d^T x_i [/math]
Find the orthogonal matrix [math]\ U_d [/math]:
[math]\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2 [/math]
Using SVD
A unique solution can be obtained by finding the Singular Value Decomposition (SVD) of [math]\ X [/math]:
[math]\ X = U S V^T [/math]
For each rank d, [math]\ U_d [/math] consists of the first d columns of [math]\ U [/math]. Also, the covariance matrix can be expressed as follows [math]\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T [/math].
Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:
[math] \tilde{X} = X - \mu [/math]
[math]\ \tilde{X} = U S V^T [/math]
Where [math]\ X [/math] is a d by n matrix of data points and the features of each data point form a column in [math]\ X [/math]. Also, [math]\ \mu [/math] is a d by n matrix with identical columns each equal to the mean of the [math]\ x_i[/math]'s, ie [math]\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i [/math]. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.
As the [math]\ S [/math] matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the [math]\ U [/math] matrix from the SVD will be such that the first column of [math]\ U [/math] is the first principal component and the second column is the second principal component and so on.
Examples
Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.
Example 1
Consider a matrix of data points [math]\ X [/math] with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:
>> % start with a 560 by 1965 matrix X that contains the data points >> load(noisy.mat); >> >> % set the colors to grayscale >> colormap gray >> >> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix >> imagesc(reshape(X(:,10),20,28)') >> >> % perform SVD, if X matrix if full rank, will obtain 560 PCs >> [S U V] = svd(X); >> >> % reconstruct X ( project X onto the original space) using only the first ten principal components >> Y_pca = U(:, 1:10)'*X; >> >> % show image in column 10 of X_hat which is now a 560 by 1965 matrix >> imagesc(reshape(X_hat(:,10),20,28)')
The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from [math]\ U [/math] matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.
Example 2
Consider a matrix of data points [math]\ X [/math] with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number 2 or a handwritten number 3 (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number 2 and the last 200 images show a handwritten number 3.
The corresponding Matlab commands for performing PCA on the data points are shown below:
>> % start with a 64 by 400 matrix X that contains the data points >> load 2_3.mat; >> >> % set the colors to grayscale >> colormap gray >> >> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix >> imagesc(reshape(X(:,2),8,8)) >> >> % perform SVD, if X matrix if full rank, will obtain 64 PCs >> [U S V] = svd(X); >> >> % project data down onto the first two PCs >> Y = U(:,1:2)'*X; >> >> % show Y as an image (can see the change in the first PC at column 200, >> % when the handwritten number changes from 2 to 3) >> imagesc(Y) >> >> % perform PCA using Matlab build-in function (do not use for assignment) >> % also note that due to the Matlab convention, the transpose of X is used >> [COEFF, Y] = princomp(X'); >> >> % again, use the first two PCs >> Y = Y(:,1:2); >> >> % use plot digits to show the distribution of images on the first two PCs >> images = reshape(X, 8, 8, 400); >> plotdigits(images, Y, .1, 1);
Using the plotdigits function in Matlab, clearly illustrates that the first PC captured the differences between the numbers 2 and 3 as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the tilt of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.
Example 3
(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref>
Example 4
(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows
[math]\,\Delta w = \eta yx -\eta y^2w [/math]
where [math]\,\Delta w [/math] is the neuron weight change, [math]\,\eta[/math] is the learning rate, [math]\,y[/math] is the neuron output given the current input, [math]\,x[/math] is the current input and [math]\,w[/math] is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below
[math]\mathbf{p} =[/math] a random vector do c times: [math]\mathbf{t} = 0[/math] (a vector of length m) for each row [math]\mathbf{x} \in \mathbf{X^T}[/math] [math]\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}[/math] [math]\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}[/math] return [math]\mathbf{p}[/math]
Comparing this with the neuron learning rule we can see that the term [math]\, \eta y x [/math] is very similar to the [math]\,\mathbf{t}[/math] update equation in the power iteration method, and identical if the neuron model is assumed to be linear ([math]\,y(x)=x\mathbf{p}[/math]) and the learning rate is set to 1. Additionally, the [math]\, -\eta y^2w [/math] term performs the normalization, the same function as the [math]\,\mathbf{p}[/math] update equation in the power iteration method.
Observations
Some observations about the PCA were brought up in class:
- PCA assumes that data is on a linear subspace or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are Locally Linear Embedding (LLE) and Isomap. More recent techniques include Maximum Variance Unfolding (MVU) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Kernel PCAs may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)
- Finding the number of PCs to use is not straightforward. It requires knowledge about the instrinsic dimentionality of data. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage.
- It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.
- PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.
- (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref>
- (Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a Weighted PCA increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref>
- (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." <ref> Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf </ref>
Summary
The PCA algorithm can be summarized into the following steps:
- Recover basis
- [math]\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.} [/math]
- Encode training data
- [math]\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.} [/math]
- Reconstruct training data
- [math] \hat{X}=UY=UU^TX [/math].
- Encode test example
- [math]\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x [/math].
- Reconstruct test example
- [math] \hat{x}=Uy=UU^Tx [/math].
Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011)
Fisher Discriminant Analysis (FDA) is sometimes called Fisher Linear Discriminant Analysis (FLDA) or just Linear Discriminant Analysis (LDA). This causes confusion with the Linear Discriminant Analysis (LDA) technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, [math]\ y_i[/math], of the data [math]\ (x_i,y_i)[/math] while FDA organizes data into their classes by finding the direction of maximum separation between classes.
Fisher Discriminant Analysis (FDA) Continued (Lecture: Oct. 04, 2011)
One main drawback of the PCA technique is that the direction of greatest variation may not produce the classification we desire. For example, imagine if the data set above had a lightening filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. As another example , if we imagine 2 cigar like clusters in 2 dimensions, one cigar has [math]y = 1[/math] and the other [math]y = -1[/math]. The cigars are positioned in parallel and very closely together, such that the variance in the total data-set, ignoring the labels, is in the direction of the cigars. For classification, this would be a terrible projection, because all labels get evenly mixed and we destroy the useful information. A much more useful projection is orthogonal to the cigars, i.e. in the direction of least overall variance, which would perfectly separate the data-cases (obviously, we would still need to perform classification in this 1-D space.) See figure below <ref>www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf</ref>. FDA circumvents this problem by using the labels, [math]\ y_i[/math], of the data [math]\ (x_i,y_i)[/math] i.e. the FDA uses supervised learning. The main difference between FDA and PCA is that, in PCA we are interested in transforming the data to a new coordinate system such that the greatest variance of data lies on the first coordinate, but in FDA, we project the data of each class onto a point in such a way that the resulting points would be as far apart from each other as possible. The FDA goal is achieved by projecting data onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.
We first consider the cases of two-classes. Denote the mean and covariance matrix of class [math]i=0,1[/math] by [math]\mathbf{\mu}_i[/math] and [math]\mathbf{\Sigma}_i[/math] respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our [math]dx1[/math]-dimensional data, [math]\mathbf{x}[/math], by a to-be-determined [math]dx1[/math]-dimensional vector [math]\mathbf{w}[/math]. The new means and covariances of the transformed data:
- [math] \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i [/math]
- [math] \Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}[/math]
- [math] \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i [/math]
The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension.
Goals of FDA
As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," Annals of Eugenics, 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.
First, we need to reduce the dimensionality of covariate to one dimension (two-class case) by projecting the data onto a line. That is take the d-dimensional input values x and project it to one dimension by using [math]\mathbf{z}=\mathbf{w}^T \mathbf{x}[/math] where [math]\mathbf{w}^T [/math] is 1 by d and [math]\mathbf{x}[/math] is d by 1.
Goal: choose the vector [math]\mathbf{w}=(w_1,w_2,w_3,...,w_z)^T [/math] that best seperate the data, then we perform classification with projected data [math]\mathbf{z}[/math] instead of original data [math]\mathbf{x}[/math] .
[math]\hat{{\mu}_0}=\frac{1}{n_0}\sum_{i:y_i=0} x_i[/math]
[math]\hat{{\mu}_1}=\frac{1}{n_1}\sum_{i:y_i=1} x_i[/math]
[math]\mathbf{x}\rightarrow\mathbf{w}^{T}\mathbf{x}[/math].
[math]\mathbf{\mu}\rightarrow\mathbf{w}^{T}\mathbf{\mu}[/math].
[math]\mathbf{\Sigma}\rightarrow\mathbf{w}^{T}\mathbf{\Sigma}\mathbf{w}[/math]
1) Our first goal is to minimize the individual classes' covariance. This will help to collapse the data together.
We have two minimization problems
- [math]\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{\Sigma}_0 \mathbf{w}[/math]
and
- [math]\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{\Sigma}_1 \mathbf{w}[/math].
But these can be combined:
- [math] \min_{\mathbf{w}} \mathbf{w} ^{T}\mathbf{\Sigma}_0 \mathbf{w} + \mathbf{w}^{T} \mathbf{\Sigma}_1 \mathbf{w}[/math]
- [math] = \min_{\mathbf{w}} \mathbf{w} ^{T}( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}[/math]
Define [math] \mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1} [/math], called the within class variance matrix.
2) Our second goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.
[math] \max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2 [/math]
Simplifying:
- [math] \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1) [/math]
- [math] = \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}^{T} \mathbf{w} (\mathbf{\mu}_0-\mathbf{\mu}_1)[/math]
- [math] = \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}[/math]
- [math] \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1) [/math]
Recall that [math] \mathbf{\mu}_i [/math] are known. Denote
- [math] \mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}[/math]
This matrix, called the between class variance matrix, is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:
- 1) [math] \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} [/math]
- 2) [math] \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} [/math]
- 1) [math] \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} [/math]
There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.
We can combine the two optimization problems into one after noting that the negative of max is min:
- [math] \max_{\mathbf{w}} \; \alpha \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} - \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} [/math]
- [math] \max_{\mathbf{w}} \; \alpha \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} - \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} [/math]
The [math]\alpha[/math] coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, [math]\alpha[/math], to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:
- [math] \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} [/math]
This optimization problem can be shown<ref> http://www.socher.org/uploads/Main/optimizationTutorial01.pdf </ref> to be equivalent to the following optimization problem:
- [math] \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}[/math]
- [math] \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}[/math]
(optimized function)
subject to:
- [math] \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 [/math]
- [math] \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 [/math]
(constraint)
A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:
- [math]L( \mathbf{w}, \lambda) = \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} - \lambda(\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}-1)[/math]
- [math] \Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w} [/math]
Setting the partial derivative to 0 gives us a generalized eigenvalue problem:
- [math] \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} [/math]
- [math] \Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} [/math]
This is a generalized eigenvalue problem and [math]\ \mathbf{w} [/math] can be computed as the eigenvector corresponds to the largest eigenvalue of
- [math] \mathbf{S}_W^{-1} \mathbf{S}_B [/math]
It is very likely that [math] \mathbf{S}_W [/math] has an inverse. If not, the pseudo-inverse<ref> http://en.wikipedia.org/wiki/Generalized_inverse </ref><ref> http://www.mathworks.com/help/techdoc/ref/pinv.html </ref> can be used. In Matlab the pseudo-inverse function is named pinv. Thus, we should choose [math]\mathbf{w}[/math] to equal the eigenvector of the largest eigenvalue as our projection vector.
In fact we can simplify the above expression further in the case of two classes. Recall the definition of [math]\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}[/math]. Substituting this into our expression:
- [math] \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w} [/math]
- [math] (\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w} [/math]
This second term is a scalar value, let's denote it [math]\beta[/math]. Then
- [math] \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w} [/math]
- [math] \Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w} [/math]
(this equation indicates the direction of the separation).
All we are interested in the direction of [math]\mathbf{w}[/math], so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as [math]\mathbf{w}[/math] would be a matrix and not a vector in higher dimensions.
Extensions to Multiclass Case
If we have [math]\ k[/math] classes, we need [math]\ k-1[/math] directions i.e. we need to project [math]\ k[/math] 'points' onto a [math]\ k-1[/math] dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,[math]\mathbf{w}[/math], is no longer a vector but instead is a matrix [math]\mathbf{W}[/math], where [math]\mathbf{W}[/math] is a d*(k-1) matrix if X is in d-dim. We transform the data as:
- [math] \mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}[/math]
so our new mean and covariances for class k are:
- [math] \mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}[/math]
- [math] \mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}[/math]
What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:
- [math]\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W} [/math]
Again, denoting [math]\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}[/math], we can simplify above expression:
- [math]\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W} [/math]
Similarly, the second optimization problem is:
- [math]\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W} [/math]
What is [math]\mathbf{S}_B[/math] in this case? It can be shown that [math]\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W [/math] where [math] \mathbf{S}_T [/math] is the covariance matrix of all the data. From this we can compute [math] \mathbf{S}_B [/math].
Next, if we express [math] \mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k ) [/math] observe that, for [math] \mathbf{A} = \mathbf{S}_B , \mathbf{S}_W [/math]:
- [math] Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1^{T} + \dots + \mathbf{w}_k \mathbf{A} \mathbf{w}_k [/math]
where [math]\ Tr()[/math] is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:
- [math] \max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})} [/math]
subject to:
- [math] Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = 1 [/math]
Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref> http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:
- [math]\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - I \right\}\end{align}[/math].
where [math]\ \Lambda[/math] is a d by d diagonal matrix.
Then, we differentiating with respect to [math]\mathbf{W}[/math]:
- [math]\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0[/math].
Thus:
- [math]\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}[/math]
- [math]\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}[/math]
where, [math] \mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}[/math]
The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, [math]\mathbf{W}[/math]. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.
Summary
FDA has two optimization problems:
- 1) [math] \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} [/math]
- 2) [math] \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} [/math]
- 1) [math] \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} [/math]
where [math]\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}[/math] is called the within class variance and [math]\ \mathbf{S}_B = \mathbf{S}_T - \mathbf{S}_W [/math] is called the between class variance where [math]\mathbf{S}_T [/math] is the variance of all the data together.
The two optimization problems are combined as follows:
- [math] \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} [/math]
By adding a constraint as shown:
- [math] \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}[/math]
subject to:
- [math] \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 [/math]
Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:
- [math]\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}[/math]
And [math]\ w [/math] can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of [math] \mathbf{S}_W^{-1} \mathbf{S}_B [/math].
Variations
Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [1]</ref>):
1) Non-Parametric LDA (NPLDA) by Fukunaga
This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).
2) Orthonormal LDA (OLDA) by Okada and Tomita
This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).
3) Generalized LDA (GLDA) by Lowe
This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.
Optical Character Recognition (OCR) using FDA
Optical Character Recognition (OCR) is a method to translate scanned, human-readable text into machine-encoded text. In class, we have employed FDA to recognize digits. A paper <ref>Manjunath Aradhya, V.N., Kumar, G.H., Noushath, S., Shivakumara, P., "Fisher Linear Discriminant Analysis based Technique Useful for Efficient Character Recognition", Intelligent Sensing and Information Processing, 2006.</ref> describes the use of FDA to recognize printed documents written in English and Kannada, the fifth most popular language in India. The researchers conducted two types of experiments: one on printed Kannada and English documents and another on handwritten English characters. In the first type of experiments, they conducted four experiments: i) clear and degraded characters in specific fonts; ii) characters in various size; iii) characters in various fonts; iv) characters with noise. In experiment i, FDA achieved 98.2% recognition rate with 12 projection vectors in 21,560 samples. In experiment ii, it achieved 96.9% recognition rate with 10 projection vectors in 11,200 samples. In experiment iii, it achieved 93% recognition rate with 17 projection vectors in 19,850 samples. In experiment iv, it achieved 96.3% recognition rate with 14 projection vectors in 20,000 samples. Overall, the recognition by FDA was very satisfying. In the second type of experiment, a total of 12,400 handwriting samples from 200 different writers were collected. With 175 samples for training purpose, the recognition rate by FDA is 92% with 35 projection vectors.
Facial Recognition using FDA
The Fisherfaces method of facial recognition uses PCA and FDA in a similar way to using just PCA. However, it is more advantageous than using on PCA because it minimizes variation within each class and maximizes class separation. The PCA only method is, therefore, more sensitive to lighting and pose variations. In studies done by Belhumeir, Hespanda, and Kiregeman (1997) and Turk and Pentland (1991), this method had a 96% recognition rate. <ref>Bagherian, Elham. Rahmat, Rahmita. Facial Feature Extraction for Face Recognition: a Review. International Symposium on Information Technology, 2008. ITSim2 article number 4631649.</ref>
Linear and Logistic Regression (Lecture: Oct. 06, 2011)
Linear Regression
Both Regression and Classification are aimed to find a function h which maps data X to feature Y. In regression, [math]\ y [/math] is a continuous variable. In classification, [math]\ y [/math] is a discrete variable. Regression problems are easier to formulate into functions (since [math]\ y [/math] is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that [math]\ y [/math] is discrete must first be relaxed. Once [math]\ y [/math] has been found using regression techniques, it is possible to determine the discrete class corresponding to the [math]\ y [/math] that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where [math]\ y [/math] values below the threshold belong to one class and [math]\ y [/math] values above the threshold belong to another class.
More formally: a more direct approach to classification is to estimate the regression function [math]\ r(\mathbf{x}) = E[Y | X][/math] without bothering to estimate [math]\ f_k(\mathbf{x}) [/math]. For the linear model, we assume either the regression function [math]r(\mathbf{x})[/math] is linear,or the linear model is a reasonable approximation.
Here is a simple example. If [math]\ Y = \{0,1\}[/math], then [math]\, h^*(\mathbf{x})= \left\{\begin{matrix} 1 &\text{, if } \hat r(\mathbf{x})\gt \frac{1}{2} \\ 0 &\mathrm{, otherwise} \end{matrix}\right.[/math]
Basically, we can use a linear function [math]\ f(x, \beta) = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0} [/math] , [math]\mathbf{x_{i}} \in \mathbb{R}^{d}[/math] and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:
[math]\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T \mathbf{x_{i}} - \mathbf{\beta_0})^2[/math]
where
[math]\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}} \\ \\ \dots \\ \\ \mathbf{\beta}_{d} \\ \\ \mathbf{\beta}_{0} \end{array} \right) \in \mathbb{R}^{d+1}[/math].
For convenience, [math]\mathbf{\beta}[/math] and [math]\mathbf{\beta}_0[/math] have been combined into a d+1 dimensional vector. And an extra term 1 is appended to [math]\ x [/math]. Thus, the function to be minimized can now be expressed as:
[math]\ \min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta} \tilde{x_i} )^2 [/math]
[math]\ = \min_{\tilde{\beta}} || y - X \tilde{\beta}^T ||^2 [/math]
where [math]\ y [/math] and [math]\tilde{\beta}[/math] are vectors and [math]\ X [/math] is a matrix.
The solution for [math]\ \tilde{\beta} [/math] is
[math]\ {\tilde{\beta}} = (XX^T)^{-1}X^Ty [/math]
Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (encoded as [math]\ \frac{-n}{n_1} [/math] and [math]\ \frac{n}{n_2}) [/math], this method is identical to LDA.
Matlab Example
The following is the code and the explanation for each step.
Again, we use the data in 2_3.m.
>>load 2_3; >>[U, sample] = princomp(X'); >>sample = sample(:,1:2);
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.
>>y = zeros(400,1); >>y(201:400) = 1;
We let y represent the set of labels coded as 0 and 1.
>>x=[sample;ones(1,400)];
Construct x by adding a row of vector 1 to data.
>>b=inv(x*x')*x*y;
Calculate b, which represents [math]\beta[/math] in the linear regression model.
>>x1=x'; >>for i=1:400 if x1(i,:)*b>0.5 plot(x1(i,1),x1(i,2),'.') hold on elseif x1(i,:)*b < 0.5 plot(x1(i,1),x1(i,2),'r.') end end
Plot the fitted y values.
Logistic Regression
Logistic regression is a more advanced method for classification, and is more commonly used. In statistics, logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences fields, as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription.<ref>http://en.wikipedia.org/wiki/Logistic_regression</ref>
We can define a function
[math]f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})[/math]
This is a valid density function. It looks similar to a step function, but
we have relaxed it so that we have a smooth curve, and can therefore take the
derivative.
The range of this function is (0,1) since
[math]\lim_{x \to -\infty}f_1(\mathbf{x}) = 0[/math] and
[math]\lim_{x \to \infty}f_1(\mathbf{x}) = 1[/math].
As shown on this graph of [math]\ P(Y=1 | X=x) [/math].
Then we compute the complement of f1(x), and get
[math]f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})[/math], denoted f2.
Function f2 is commonlly called Logistic function, and it behaves like
[math]\lim_{x \to -\infty}f_2(\mathbf{x}) = 1[/math] and
[math]\lim_{x \to \infty}f_2(\mathbf{x}) = 0[/math].
As shown on this graph of [math]\ P(Y=0 | X=x) [/math].
From here, we can form the conditional density function. To do this, we must combine
[math]f_1[/math] and [math]f_2[/math]
such that
[math]f_1=1[/math] and[math]f_2=0[/math] if y=1 ( which means it’s in class 1),
and [math]f_1=0[/math] and [math]f_2=1[/math] if y=2 (which means it’s in class 2).
Eventually, we have our conditional density function formula
[math]f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}[/math]
To way to use this formula is, with given the training data (x(i), y(i)),to fit the data with [math]f(Y : X)[/math].
In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input (xi) to our output (yi). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem intoan optimization problem.
Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of f(x|parameter), we need to introduce Maximum Likelihood Estimation.
Maximum Likelihood Estimation
Given iid data points [math]({\mathbf{x}_i})_{i=1}^n[/math] (i.i.d) and density function [math]f(\mathbf{x}|\mathbf{\theta})[/math], where the form of f is known but the parameters [math]\theta[/math] are our unknown. For example, we know that the data come from a Gaussian distribution but we don't know the mean and variance of the distribution. The maximum likelihood estimation of [math]\theta\,_{ML}[/math] is a set of parameters that maximize the probability of observing [math]({\mathbf{x}_i})_{i=1}^n[/math] given [math]\theta\,_{ML}[/math].
[math]\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)[/math].
There was some discussion in class regarding the notation. In literature, Bayesians use [math]f(\mathbf{x}|\mu)[/math] while Frequentists use [math]f(\mathbf{x};\mu)[/math]. In practice, these two are equivalent.
Our goal is to find theta to maximize [math]\mathcal{L}(\theta\,) = f(\underline{\mathbf{x}}|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)[/math]. where [math]\underline{\mathbf{x}}=\{x_i\}_{i=1}^{n}[/math] (The second equality holds because data points are iid.)
In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.) [math]\ell(\theta)=\ln\mathcal{L}(\theta\,)[/math]
[math]\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)[/math]
Applying Maximum Likelihood Estimation to [math]f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}[/math], gives
[math]\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i}[/math]
[math]\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}[/math]
[math]\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}[/math]
[math]\frac{\partial \ell}{\partial \mathbf{\beta\,}}[/math] can be numerically solved by Newton’s Method.
Newton's Method
Newton's Method (or Newton-Raphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form.
The goal is to find [math]\mathbf{x}[/math] such that [math] f(\mathbf{x}) = 0 [/math], such Xs are called the roots of function f. The recursion can be implemented by [math]\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf{x_0})}{f'(\mathbf{x_0})}.\,\! [/math].
It takes an initial guess [math]\mathbf{x_0}[/math] and the direction [math]\ \frac{f(x_0)}{f'(x_0)}[/math] that moves toward a better approximation. It then finds a newer and better [math]\mathbf{x_1}[/math]. Taking this [math]\mathbf{x_1}[/math] as [math]\mathbf{x_0}[/math] in the second run, it finds a newer and better [math]\mathbf{x_1}[/math] than the previous [math]\mathbf{x_1}[/math]. Repeating the same process, the [math]\mathbf{x_1}[/math] will be sufficiently accurate to the actual solutions.
Matlab Example
Below is the Matlab code to find a root of the function [math]\,y=x^2-2500[/math] from the initial guess of [math]\,x=90[/math]. The roots of this equation are trivially solved analytically to be [math]\,x=\pm 50[/math].
x=1:100; y=x.^2 - 2500; %function to find root of plot(x,y); x_opt=90; %starting guess x_traversed=[]; y_traversed=[]; error=[]; for i=1:6, y_opt=x_opt^2-2500; y_prime_opt=2*x_opt; %save results of each iteration x_traversed=[x_traversed x_opt]; y_traversed=[y_traversed y_opt]; error=[error abs(y_opt)]; %update minimum x_opt=x_opt-(y_opt/y_prime_opt); end hold on; plot(x_traversed,y_traversed,'r','LineWidth',2); title('Progressions Towards Root of y=x^2 - 2500'); legend('y=x^2 - 2500','Progression'); xlabel('x'); ylabel('y'); hold off; figure(); semilogy(1:6,error); title('Error vs Iteration'); xlabel('Iteration'); ylabel('Absolute Y Error');
In this example the Newton method converges to an optimum to within machine precision in only 6 iterations as can be seen from the plot of the Y deviate below.
File:newton error.png File:newton progression.png
Advantages of Logistic Regression
Logistic regression has several advantages over discriminant analysis:
- it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group
- It does not assume a linear relationship between the IV and DV
- It may handle nonlinear effects
- You can add explicit interaction and power terms
- The DV need not be normally distributed.
- There is no homogeneity of variance assumption.
- Normally distributed error terms are not assumed.
- It does not require that the independents be interval.
- It does not require that the independents be unbounded.
Comparison Between Logistic Regression And Linear Regression
Linear regression is a regression where both explanatory variable X and response variable Y are linear related. Both X and Y can be continuous variables, and for every one unit increase in the explanatory variable, there is a set increase or decrease in response variable Y. A closed form solution exists for the least squares estimate of [math]\beta[/math].
Logistic regression is a regression where both explanatory variable X and response variable Y are not linear related. The response variable provides the probability of occurrence of an event. X can be continuous but Y must be categorical variable (only has two outcomes, i.e. 0 or 1). For one every unit increase in the explanatory variable, there is a set increase or decrease in the probability of occurrence of the event. No closed form solution exists for the least squares estimate of [math]\beta[/math].
Newton-Raphson Method (Lecture: Oct 11, 2011)
Previously we had derivated the log likelihood function for the logistic function.
[math]\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i} \right) \end{align}[/math]
After taking log, we can have:
[math]\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \log{\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}} + (1 - y_i) \log{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}} \right) \end{align}[/math]
This implies that:
[math]\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}[/math]
Our goal is to find the [math]\beta\,[/math] that maximizes [math]{\ell(\mathbf{\beta\,})}[/math]. We use calculus to do this ie solve [math]{\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0[/math]. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method where we calculate the first and second derivative at each iteration.
[Newton's Method:]
Here is how we usually implement Newton's Method: [math]\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf{x_0})}{f'(\mathbf{x_0})}.\,\!
[/math]. In our particular case, we look for x such that g'(x) = 0, and implement it by [math]\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf'{x_0})}{f''(\mathbf{x_0})}.\,\!
[/math].
In practice, the convergence speed depends on |F'(x*)|, where F(x) = [math]\mathbf{x} - \frac{f(\mathbf{x})}{f'(\mathbf{x})}.\,\![/math]. The smaller the |F'(x*)| is, the faster the convergence is.
[end]
The first derivative is typically called the score vector.
[math]\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}} \mathbf{x_i} \right) \\[8pt] \end{align}[/math]
[math]\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(x_i|\beta) \mathbf{x_i} \right) \\[8pt] \end{align}[/math]
where [math]\ P(x_i|\beta) = \frac{e^{\beta^T x_i}}{1+e^{\beta^T x_i}} [/math]
The negative of the second derivative is typically called the information matrix.
[math]\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})(1 - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) \right) \\[8pt] \end{align}[/math]
[math]\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) \right) \\[8pt] \end{align}[/math]
[math]\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (P(x_i|\beta))(1 - P(x_i|\beta)) \right) \\[8pt] \end{align}[/math]
again where [math]\ P(x_i|\beta) = \frac{e^{\beta^T x_i}}{1+e^{\beta^T x_i}} [/math]
We then use the following update formula to calcalute continually better estimates of the optimal [math]\beta\,[/math]. It is not typically important what you use as your initial estimate [math]\beta\,^{(1)}[/math] is.
[math] \beta\,^{(r+1)} {}= \beta\,^{(r)} + (I(\beta\,^{(r)}))^{-1} S(\beta\,^{(r)} )[/math]
Matrix Notation
Let [math]\mathbf{y}[/math] be a (n x 1) vector of all class labels. This is called the response in other contexts.
Let [math]\mathbb{X}[/math] be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.
Let [math]\mathbf{p}^{(r)}[/math] be a (n x 1) vector with values [math] P(\mathbf{x_i} |\beta\,^{(r)} ) [/math]
Let [math]\mathbb{W}^{(r)}[/math] be a (n x n) diagonal matrix with [math]\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))[/math]
The score vector, information matrix and update equation can be rewritten in terms of this new matrix notation, so the first derivative is
[math]\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}[/math]
And the second derivative is
[math]\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial^{2} \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}[/math]
Therfore, we can fit a regression problem as follows
[math] \beta\,^{(r+1)} {}= \beta\,^{(r)} + (I(\beta\,^{(r)}))^{-1}S(\beta\,^{(r)} ) {}[/math]
[math] \beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})[/math]
Iteratively Re-weighted Least Squares
If we reorganize this updating formula we can see it is really iteratively solving a least squares problem each time with a new weighting.
[math]\beta\,^{(r+1)} {}= (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))[/math]
[math]\beta\,^{(r+1)} {}= (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}[/math]
where [math] \mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)}) [/math]
Recall that linear regression by least squares finds the following minimum: [math]\ \min_{\beta}(y-X^T \beta)^T(y-X^T \beta)[/math]
Similarly, we can say that [math]\ \beta^{(r+1)}[/math] is the solution of a weighted least square problem in the new space of [math]\ \mathbf{z} [/math]:
[math]\beta^{(r+1)} \leftarrow arg \min_{\beta}(\mathbf{z}-X \beta)^T W (\mathbf{z}-X \beta)[/math]
Fisher Scoring Method
Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.
Multi-class Logistic Regression
In a multi-class logistic regression we have K classes. For 2 classes K and l
[math]\frac{P(Y=l|X=x)}{P(Y=K|X=x)} = e^{\beta_l^T x}[/math]
We call [math]log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = (\beta_l-\beta_k)^T x[/math] , the log ratio of the posterior probabilities as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.
For each class from 1 to K-1 we then have:
[math]log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = \beta_1^T x[/math]
[math]log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = \beta_2^T x[/math]
[math]log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = \beta_{K-1}^T x[/math]
Note that choosing Y=K is arbitrary and any other choice is equally valid.
Based on the above the posterior probabilities are given by: [math]P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}\;\;for \; k=1,\ldots, K-1[/math]
[math] P(Y=K|X=x)=\frac{1}{1+\sum_{i=1}^{K-1}{e^{\beta_i^T x}}} [/math]
Logistic Regression Vs. Linear Discriminant Analysis (LDA)
Logistic Regression Model and Linear Discriminant Analysis (LDA) are widely used for classification. Both models build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive.
However, these two models differ in their basic approach. While Logistic Regression is more relaxed and flexible in its assumptions, LDA assumes that its explanatory variables are normally distributed, linearly related and have equal covariance matrices for each class. Therefore, it can be expected that LDA is more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate.
Also, the total number of parameters to compute is different for Logistic Regression and LDA. If the explanatory variables have d dimensions and there are two classes to categorize, we need to estimate [math]\ d+1[/math] parameters in Logistic Regression (all elements of the d by 1 [math]\ \beta [/math] vector plus the scalar [math]\ \beta_0 [/math]) and the number of parameters grows linearly w.r.t. dimension, while we need to estimate [math]2d+\frac{d*(d+1)}{2}+2[/math] parameters in LDA (two mean values for the Gaussians, the d by d symmetric covariance matrices, and two priors for the two classes) and the number of parameters grows quadratically w.r.t. dimension.
Note that the number of parameters also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations. Logistic Regression can be modified using shrinkage methods to deal with the problem of having less observations than parameters. When maximizing the log likelihood, we can add a [math]-\frac{\lambda}{2}\sum^{K}_{k=1}\|\beta_k\|_{2}^{2}[/math] penalization term where K is the number of classes. This resulting optimization problem is convex and can be solved using Newton-Raphson method as given in Zhu and hastie (2004). LDA involves the inversion of a d x d covariance matrix. When d is bigger than n (where n is the number of observations) this matrix has rank n < d and thus is singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix [math]\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma)[/math] with [math]\gamma \in [0,1][/math]. Cross validation can be used to calculate the best [math]\, \gamma[/math]. More details on RDA can be found in Guo et al. (2006).
Because the Logistic Regression model has the form [math]log\frac{f_1(x)}{f_0(x)} = \beta{x}[/math], we can clearly see the role of each input variable in explaining the outcome. This is one advantage that Logistic Regression has other other classification methods and is why it is so popular in data analysis.
In terms of the performance speed, since LDA is non-iterative, unlike Logistic Regression which uses the iterative Newton-Raphson method, LDA can be expected to be faster than Logistic Regression.
Perceptron
The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets.
The LS computes a linear combination of factor of input and returns the sign.
For a 2 class problem, and a set of inputs with d features, a perceptron will use a weighted sum and it will classify the information using the sign of the result (i.e it uses a step function as it's activation function ). The figures on the right give an example of a perceptron. In these examples, [math]x^i[/math] is the i-th feature of a sample and [math]\beta_i[/math] is the i-th weight. [math]\beta_0[/math] is defined as the bias. The bias alters the position of the decision boundary between the 2 classes.
Perceptrons are generally trained using gradient descent. This type of learning can have 2 side effects:
- If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions,
- If the data sets are not linearly separable, the learning algorithm will never finish.
Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as:
- Multi-layer perceptron, which uses multiple layers of perceptron’s with nonlinear activation functions so that it can classify data that is not linearly separable.
- ADALINE, which differs from the standard perceptron by using the weighted sum (the net) to adjust the weights in the learning phase. The standard perceptron uses the output to adjust its weights (i.e. the net after it passed through the activation function).
- MADALINE, which is a two layer network using ADALINE nodes to process multiple inputs.
Perceptron Learning Algorithm (Lecture: Oct. 13, 2011)
Our goal is to find the best hyper-plane such that the distance between misclassified points and the hyper-plane is minimized. To achieve this, we define a cost function, [math]\phi(\boldsymbol{\beta}, \beta_0)[/math], as a summation of the distance between all misclassified points and the hyper-plane. To minimize this cost function, we need to estimate [math]\boldsymbol{\beta, \beta_0}[/math]. The logic is as follows:
1) Because a hyper-plane [math]\,L[/math] can be defined as
[math]\, L=\{x: f(x)=\beta^Tx+\beta_0=0\},[/math]
For any two arbitrary points [math]\,x_1 [/math] and [math]\,x_2 [/math] on [math]\, L[/math], we have
[math]\,\beta^Tx_1+\beta_0=0[/math],
[math]\,\beta^Tx_2+\beta_0=0[/math],
such that
[math]\,\beta^T(x_1-x_2)=0[/math].
where [math]\,\beta[/math] is orthogonal to the hyper-plane.
2) For any point [math]\,x_0[/math] in [math]\ L,[/math] [math]\,\;\;\beta^Tx_0+\beta_0=0[/math], which means [math]\, \beta^Tx_0=-\beta_0[/math].
3) We set [math]\,\beta^*=\frac{\beta}{||\beta||}[/math] as a norm vector of the hyper-plane[math]\, L[/math].
The sign distance of a point [math]\,x[/math] to[math]\ L[/math] is given by
[math]\,\beta^{*T}(x-x_0)=\beta^{*T}x-\beta^{*T}x_0 =\frac{\beta^Tx}{||\beta||}+\frac{\beta_0}{||\beta||} =\frac{(\beta^Tx+\beta_0)}{||\beta||}[/math]
Hence, [math]\,\beta^Tx+\beta_0[/math] is proportional to the distance of the point [math]\,x[/math] to the hyper-plane[math]\, L[/math].
4) The distance from a misclassified data point [math]\,x_i[/math] to the hyper-plane [math]\, L [/math] is
[math]\,-y_i(\boldsymbol{\beta}^Tx_i+\beta_0)[/math]
where [math]\,y_i[/math] is a target value, such that [math]\,y_i=1[/math] if [math]\boldsymbol{\beta}^Tx_i+\beta_0\lt 0[/math], [math]\,y_i=-1[/math] if [math]\boldsymbol{\beta}^Tx_i+\beta_0\gt 0[/math]
Note that the added negative sign in front of the [math]\,y_i[/math] in the above expression is to find the distance of the misclassified data points (i.e. when the sign of [math]\,y_i[/math] is different from [math]\boldsymbol{\beta}^Tx_i+\beta_0[/math]) instead of the correctly classified data points (i.e. when the sign of [math]\,y_i[/math] is the same as [math]\boldsymbol{\beta}^Tx_i+\beta_0[/math]).
To minimize the cost function [math]\phi(\boldsymbol{\beta}, \beta_0) = -\sum\limits_{i\in M} y_i(\boldsymbol{\beta}^Tx_i+\beta_0)[/math] where [math]\ M=\{\text {all points that are misclassified}\}[/math]
[math]\cfrac{\partial \phi}{\partial \boldsymbol{\beta}} = - \sum\limits_{i\in M} y_i x_i [/math] and [math] \cfrac{\partial \phi}{\partial \beta_0} = -\sum\limits_{i \in M} y_i[/math]
Using the gradient descent algorithm to solve these two equations, we have [math]\begin{pmatrix} \boldsymbol{\beta}^{\mathrm{new}}\\ \beta_0^{\mathrm{new}} \end{pmatrix} = \begin{pmatrix} \boldsymbol{\beta}^{\mathrm{old}}\\ \beta_0^{\mathrm{old}} \end{pmatrix} + \rho \begin{pmatrix} y_i x_i\\ y_i \end{pmatrix}[/math]
where [math]\ \rho [/math] is the learning rate.
Since the data is linearly-separable, the solution is theoretically guaranteed to converge to a seperating hyperplane in a finite number of iterations. This number of iterations depends on the learning rate which has the initial value of:
[math]\begin{pmatrix} \beta^0\\ \beta_0^0 \end{pmatrix}[/math]
Note that we consider the offset term [math]\,\beta_0[/math] separately from [math]\ \beta[/math] to distinguish this formulation from those in which the direction of the hyperplane ([math]\ \beta[/math]) has been considered.
A major concern about gradient descent is that it may get trapped in local optimal solutions. Many works such as this paper by Cetin et al. and this paper by Atakulreka et al. have been done to tackle this issue.
Features
- When data is (linearly) separable, there are many solutions depending on the starting point.
- Even though convergence to a solution is guaranteed if the solution exists, the finite number of steps until convergence can be very large.
- The smaller the gap between the two classes, the longer the time of convergence.
- When the data is not separable, the algorithm will not converge.
- A learning rate that is too high will make the perceptron periodically oscillate around the solution unless additional steps are taken.
Separability and convergence
The training set D is said to be linearly separable if there exists a positive constant [math]\,\gamma[/math] and a weight vector [math]\,\beta[/math] such that [math]\,(\beta^Tx_i+\beta_0)y_i\gt \gamma [/math] for all [math]\,1 \lt i \lt n[/math]. That is, if we say that [math]\,\beta[/math] is the weight vector to the perceptron, then the output of the perceptron, , multiplied by the desired output of the perceptron, [math]\,y_i[/math], must be greater than the positive constant, [math]\,\gamma[/math], for all input-vector/output-value pairs [math]\,(x_i,y_i) [/math] in [math]\, D_n [/math].
Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data set is linearly separable. The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction that it has a negative dot product with, and thus can be bounded above by [math]O(\sqrt{t})[/math]where t is the number of changes to the weight vector. But it can also be bounded below by[math]\, O(t)[/math]because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive amount that depends only on the input vector. This can be used to show that the number t of updates to the weight vector is bounded by [math] (\frac{2R}{\gamma} )^2[/math] , where R is the maximum norm of an input vector.<ref>http://en.wikipedia.org/wiki/Perceptron</ref>
Feed-Forward Neural Networks
- The term 'neural networks' is used because historically, it was used to describe the processes of the brain (e.g. synapses).
- A neural network is a multistate regression model which is typically represented by a network diagram (see right).
- The feedforward neural network was the first and arguably simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.<ref>http://en.wikipedia.org/wiki/Feedforward_neural_network</ref>
- Tor regression, typically k = 1 (the number of nodes in the last layer), there is only one output unit [math]y_1[/math] at the end.
- Tor c-class classification, there are typically c units at the end with the cth unit modelling the probability of class c, each [math]y_c[/math] is coded as 0-1 variable for the cth class.
- Neural networks are known as universal approximators, where a two-layer feed-forward neural network can approximate any continuous function to an arbitrary accuracy (assuming sufficient hidden nodes exist and that the necessary parameters for the neural network can be found).<ref>C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006</ref>
Finding the Optimal Weights
There are many algorithms for calculating the weights in a feed-forward neural network. One of the most used approaches is the backpropagation algorithm. The application of the backpropagation algorithm for neural networks was popularized in the 1980's by researchers like Rumelhart, Hinton and McClelland (even though the backpropagation algorithm had existed before then). <ref>S. Seung, "Multilayer perceptrons and backpropagation learning" class notes for 9.641J, Department of Brain & Cognitive Sciences, MIT, 2002. Available: [2] </ref>
For the backpropagation algorithm, we consider three hidden layers of nodes
where [math]\ l[/math] represents the column of nodes in the first column,
[math]\ i[/math] represents the column of nodes in the second column, and
[math]\ k[/math] represents the column of nodes in the third column.
We want the output of the feed forward neural network [math]\hat{y}[/math] to be as close to the known target value [math]\ y [/math] as possible. Thus, we want to minimize [math](\left| y- \hat{y}\right|)^2[/math].
Instead of the sign function that has not derivative we use the so called logistic function (a smoothed form of the sign function):
[math] \sigma(a)=\frac{1}{1+e^{-a}} [/math]
"Notice that if σ is the identity function, then the entire model collapses to a linear model in the inputs. Hence a neural network can be thought of as a nonlinear generalization of the linear model, both for regression and classification."<ref>Friedman, J., Hastie, T. and Tibshirani, R. (2008) “The Elements of Statistical Learning”, 2nd ed, Springer.</ref>
Logistic function is a common sigmoid curve .It can model the S-curve of growth of some population [math] \sigma[/math]. The initial stage of growth is approximately exponential; then, as saturation begins, the growth slows, and at maturity, growth stops.
To solve the optimization problem, we take the derivative with respect to weight [math]u_{il}[/math]:
[math]\cfrac{\partial \left|y- \hat{y}\right|^2}{\partial u_{il}} = \cfrac{\partial \left|y- \hat{y}\right|^2}{\partial a_j} \cdot \cfrac{\partial a_j}{\partial u_{il}}[/math] by Chain rule
[math]\cfrac{\partial \left|y- \hat{y}\right|^2}{\partial u_{il}} = \delta_j \cdot [/math] [to be inserted after the Oct 18th lecture]
where [math] \delta_j = \cfrac{\partial \left|y- \hat{y}\right|^2}{\partial a_j} [/math] which will be computed recursively.
[math]a_j = z_i \cdot u_{ji}[/math]
[math]a_j = \sigma(a_i) \cdot u_{ji} [/math] since [math] z_i = \sigma(a_i)[/math]
Continue back propagation Oct 18, 2011 lecture
The goal is to find the derivative of [math]\left|y - \hat{y}\right|^2[/math] with respect to each weight in the hidden layers as shown in fig 1. By using the chain rule we get
[math] \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{il}} = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot \cfrac{\partial a_i}{\partial u_{il}} = \delta_{i}z_l [/math]
We develop recursive definition ([math]\delta_i[/math] in terms of [math]\delta_j[/math]) because [math]\delta_i[/math] is uknown.
[math]
\delta_i = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}
= \sum_{j} \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot
\cfrac{\partial a_j}{\partial a_i} [/math]
[math]\delta_i =\sum_{j}\delta_j\cdot\cfrac{\partial a_j}{\partial z_i}\cdot\cfrac{\partial z_i}{\partial a_i} = \sum_{j} \delta_j\cdot u_{ji} \cdot \sigma'(a_i)[/math]
[math]\delta_i= \sigma'(a_i)\sum_{j}\delta_j \cdot u_{ji}[/math].
Let's define [math]\delta_k[/math] as:
[math]\delta_k= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}[/math].
However, [math]\,a_k = \hat{y}[/math] because an activation function is not applied in the output layer. So, our calculation becomes
[math]\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}} = -2(y - \hat{y})[/math]
The network weights are updated using the back-propagation algorithm when each training data point [math]\underline{x}[/math]is fed into the FFNN. This update procedure is done using the following steps:
- First arbitrarily choose some random weights (preferably close to zero) for your network.
- Apply [math]\underline{x}[/math] to the FFNN's input layer, and calculate the outputs of all input neurons.
- Propagate the outputs of each hidden layer forward, one hidden layer at a time, and calculate the outputs of all hidden neurons.
- Once [math]\underline{x}[/math] reaches the output layer, calculate the output(s) of all output neuron(s) given the outputs of the previous hidden layer.
At the output layer, compute [math]\,\delta_k = -2(y_k - \hat{y}_k)[/math] for each output neuron(s), then compute [math]\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{il}} = \delta_{j}z_l[/math] for all weights [math]\,u_{il}[/math], and then update [math]u_{il}^{\mathrm{new}} \leftarrow u_{il}^{\mathrm{old}} - \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{il}} [/math] for all weights [math]\,u_{il}[/math].
- Continue for next data points until weights converge.
Limitations
- The convergence obtained from backpropagation learning is very slow.
- The convergence in backpropagation learning is not guaranteed.
- The result may generally converge to any local minimum on the error surface, since stochastic gradient descent exists on a surface which is not flat.
- Backpropagation learning requires input scaling or normalization. Inputs are usually scaled into the range of +0.1f to +0.9f for best performance.<ref>http://en.wikipedia.org/wiki/Backpropagation</ref>
Model selection
One important question to be asked here
How do we choose a good classifier ?
We would like to have a classifier that minimizes the true error rate [math]\ L(h)[/math].
[math]\ L(h)=Pr\{h(x)\neq y\}[/math]
For any type of model, we want to have a classifier that does not have zero error rate to avoid overfitting and also a classifer that does not have a high error rate (too much misclassificatio).
The empirical true error rate is
[math]\ \hat L(h)= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})[/math]
which can not be used because it is biased downward.
In ordee to avoid overfitting, there are two main stratigies:
- Estimate the error rate
- Cross-validation
- Computing error bound ( probability in-equality )
- Regulazition
We basically make the function (model) smooth by limiting the complexity or by limiting the size of weights.
Cross validation
Cross-validation is an approach for avoiding over-fitting while modelling data that bases the choice of model parameters on a portion of the training set, while using the rest of the set for validation, i.e., some of the data is left out when fitting the model. One round of the process involves partitioning the data set into two complementary subsets, fitting the model to one subset (called the training set), and testing the model against the other subset (called the validation or testing subset). This is usually repeated several times using different partitions in order to reduce variability, and the validation results are then averaged over the rounds.
A common type of cross-validation that is used for relatively small data sets is K-fold cross-validation, the algorithm for which can be stated as follows:
Let h denote a classification model to be fitted to a given data set.
- Randomly partition the original data set into K subsets of approximately the same size. A common choice for K is K = 10.
- For k = 1 to K do the following
- Remove subset k from the data set
- Estimate the parameters of the classification model based only on the remaining data points. Denote the resulting function by h(k)
- Use h(k) to predict the data points in subset k. Denote by [math]\begin{align}\hat L_k(h)\end{align}[/math] the observed error rate.
- Compute the average error [math]\hat L(h) = \frac{1}{K} \sum_{k=1}^{K} \hat L_k(h)[/math]
The best classifier is the model that results in the lowest average error rate.
A common variation of k-fold cross-validation uses a single observation from the original sample as the validation data, and the remaining observations as the training data. This is then repeated such that each sample is used once for validation. It is the same as a K-fold cross-validation with K being equal to the number of points in the data set, and is referred to as leave-one-out cross-validation.
References
<ref> stat.psu.edu/~jiali/course/stat597e/notes2/percept.pdf </ref> <references />