Most people are familiar with the concept of matrix eigenvalues. Less well known is that this concept can be fruitfully expanded to the generalized eigenvalues of pairs of matrices. Closely related are matrix trace optimization problems, which extremizes the trace of certain matrix products. Trace optimization constites a large class of practically solvable non-convex optimization problems commonly useful for dimensionality reduction and which includes the unsupervised weighted principle component analysis and the supervised method of Fisher’s Linear Discriminant. The purpose of this post is to explore some of these problems, their intuition, and their applications.
Introduction
The eigenvalues of a matrix are profoundly important quantities and are usually encountered early on in one’s mathematical career. The concept of the eigenvalues of a single matrix \(A\) can be expanded to the generalized eigenvalues of a pair of matrices \((A, B)\). My purpose here is to summarize and distill some of what I’ve learned about the theory and application of generalized eigenvalues.
Spectral Theory
Before discussing eigenvalue problems, we need to set up some basic notation and review critical facts about matrices.
Consider an \(n \times n\) matrix \(A \in \R^{n \times n}\). The eigenvalues \(\lambda_i \in \mathbb{C}\) and eigenvectors \(v_i \in \mathbb{C}^n\) of the matrix \(A\) are those (complex!) scalars and vectors such that \[Av_i = \lambda_i v_i.\] That is, the eigenvectors are the “directions” in which the operation of the matrix \(A\) are “simple”, i.e., nothing but multiplication by a scalar. It is one of the hallmarks of linear algebra that the eigenvectors of an \(n \times n\) matrix will often (not always) happen to provide us with a basis for \(\mathbb{R}^n\), i.e., we can use the eigenvectors as coordinate axes. This is incredibly useful when working with the matrix \(A\), since the matrix does nothing but scale the axes in this basis, a rather simple operation. When this is possible, we say that the matrix is diagonalized by the change of basis. Conditions of matrix diagonalizability are captured by the Spectral Theorem, so called because the collection of eigenvalues of a matrix are often referred to as “the spectrum” (of the matrix).
The Spectral Theorem
To diagonalize the matrix \(A\) means that we can re-write it as \(A = V \Lambda V^{-1}\) where \(\Lambda = \mathsf{Dg}(\mathbf{\lambda})\) is a diagonal matrix of the eigenvalues of \(A\), and \(V = \begin{bmatrix}v_1 & v_2 & \cdots & v_n \end{bmatrix} \in \mathbb{C}^{n \times n}\) are the eigenvectors. The matrix \(V^{-1}\) represents a transformation into an eigenbasis and \(V\) is a transformation out of it. Unfortunately, diagonalizing a matrix is not always possible, and matrix diagonalizability is quite a slippery issue. Fortunately though, there is a simple criteria, occuring frequently for matrices of interest in practice, that guarantees that \(A\) is diagonalizable. The condition is that \(A\) be real-valued and symmetric: \(A = A^{\mathsf{T}}\) (more generally, \(A\) can be normal: \(AA^{\mathsf{T}} = A^{\mathsf{T}} A\)).
Spectral Theorem: Let $A \in \R^{n \times n}$ be a symmetric matrix. Then, there exist $n$ real eigenvalues $\lambda_i$ of $A$ and $n$ eigenvectors $v_i \in \R^n$ such that $A_i v_i = \lambda_i v_i$. Moreover, the vectors $v_i$ constitute an orthonormal basis of $\R^n$.
If we express \(x\) in the eigenbasis as \(x = \sum_{i = 1}^n \langle v_i, x \rangle v_i\) then the matrix-vector product is \(Ax = \sum_{i = 1}^n \lambda_i \langle v_i, x \rangle v_i\), and in matrix notation this is simply \(Ax = V\Lambda V^{\mathsf{T}}x\) where \(V = \begin{bmatrix}v_1\ \cdots v_n \end{bmatrix}\). In fact, this shows that \(A = V\Lambda V^{\mathsf{T}}\), and since \(V\) is an orthonormal basis we have \(V^{\mathsf{T}}V = I_n = VV^{\mathsf{T}}\) meaning that \(V^{-1} = V^{\mathsf{T}}\). So the eigenvectors of \(A\) diagonalize \(A\) itself.
The spectral theorem is one of the most important results in mathematics and generalizes in various ways to certain sorts of infinite dimensional operators.
Remark: Conventionally, the eigenvalues are sorted from low to high as in $\lambda_1(A) \le \cdots \le \lambda_n(A)$ (the opposite sorting convention is also common).
The Courant-Fischer Variational Characterization
The Courant-Fischer theorem provides a variational characterization of the eigenvalues of a symmetric matrix. Basically, a variational characterization is a way to calculate some quantity in terms of an optimization problem, the term “variational” presumably arising from the Calculus of Variations , and the idea of an infinitestimal “variation” of a function. Specifically, the smallest eigenvalue \(\lambda_1(A)\) of an \(n \times n\) symmetric matrix can be characterized by the optimization problem
\begin{equation} \begin{aligned} \underset{x \in \R^n}{\text{maximize}}&\ x^{\mathsf{T}} A x\\ \text{subject to}&\ ||x||_2 = 1, \end{aligned}\tag{$\lambda_n(A)$} \end{equation}
where the maximum is obtained by an eigenvector corresponding to the smallest eigenvalue \(\lambda_n\), which is also the value of the problem. Variational characterizations are incredibly useful in practice – they are often providing one or both of: (a) a concrete procedure for calculating some quantity by means of an optimization algorithm or (b) a way to solve seemingly difficult optimization problems by directly calculating the quantity known to be the solution. The optimization problem I called \((\lambda_n(A))\) above falls into the second category. Particularly since it is not a convex problem, directly applying an iterative optimization algorithm to this problem is not likely to pan out well, but there do exist reliable algorithms for calculating matrix eigenvalues.
To see why the problem \((\lambda_n(A))\) provides us with the smallest eigenvalue, we recognize that it is exactly equivalent to the following:
\begin{equation} \begin{aligned} \underset{u \in \R^n}{\text{maximize}}&\ u^{\mathsf{T}} \Lambda u\\ \text{subject to}&\ ||u||_2 = 1, \end{aligned}\notag \end{equation}
where \(\Lambda\) is the diagonal matrix of eigenvalues from \(A = V\Lambda V^{\mathsf{T}}\). This equivalence follows since \(V\) is an orthonormal basis: for any \(x\) with \(||x||_2 = 1\) there exists a \(v\) such that \(x = Vu\) and \(||u||_2 = 1\) since \(||Vu||_2 = \sqrt{\lang Vu, Vu\rang} = \sqrt{\lang u, V^{\mathsf{T}} Vu\rang} = ||u||_2\). The optimizer of this latter problem is clearly \(u^\star = (0, 0, \ldots, 1)\) (selecting out the largest eigenvalue) with the value \(\lambda_n\) and the corresponding solution of the original problem is \(x = Vu^\star = v_n\), an eigenvector associated with \(\lambda_n\).
Remark: Incidentally, this also shows that $\lambda_n(A)$ is a convex function of \(A\) since \(A \mapsto x^{\mathsf{T}} A x\) is a linear function of the matrix $A,$ and the maximum over a family of linear functions is a convex function. Similarly, it can be shown that \(\lambda_1(A)\) is a concave function of $A,$ and can be obtained by replacing “maximize” with “minimize” in the variational problem.
The Courant-Fischer theorem takes this one step further by providing a variational characterization of any of the eigenvalues. Specifically,
Theorem (Courant-Fischer): Let \(A \in \R^{n \times n}\) be a real symmetric matrix. Then
\begin{equation} \lambda_k = \underset{U}{\text{min}}\ \Bigl\{\underset{x \in \R^n}{\text{max}}\ x^{\mathsf{T}} A x\ \big|\ ||x||_2 = 1, x \in U, \mathsf{dim}(U) = k \Bigr\},\notag \end{equation}
where the minimization is over subspaces of \(\R^n\).
Intuitively, if \(x\) can live in a \(k\) dimensional space, then the maximization over \(x\) can always attain at least the value of the \(k^{\text{th}}\) smallest eigenvalue, no matter which \(k\) dimensional space it is confined to. From the optimization problem described earlier, the maximizing \(x\) is given by the eigenvector corresponding to the largest eigenvalue, subject to the constraint that that eigenvector be in the space \(U\). The minimizing space \(U\) is then of course given by the space spanned by the \(k\) smallest eigenvectors \(U = \mathsf{span}\{v_1, \ldots, v_k\}\).
To finally see the connection with trace optimization problems, it follows from the Courant-Fisher theorem that the value of the optimization problem (generalizing the first variational characterization above)
\begin{equation} \begin{aligned} \underset{U \in \R^{n \times k}}{\text{minimize}}&\ \mathsf{tr}\ U^{\mathsf{T}} A U\\ \text{subject to}&\ U^{\mathsf{T}} U = I \end{aligned}\tag{${\mathcal{E}_k}$} \end{equation}
is given by \(\lambda_1 + \cdots + \lambda_k\), and that an optimizing \(U = \begin{bmatrix}v_1\ \cdots\ v_k \end{bmatrix}\) is given by the \(n \times k\) matrix formed from the first \(k\) eigenvectors. When reading this problem, be careful to remember that \(U^{\mathsf{T}} U = I_k\) definitely does not imply \(UU^{\mathsf{T}} = I_n\), unless \(n = k\).
This optimization problem can be obtained from the Courant-Fischer theorem more-or-less by simply taking the summation of the first \(k\) eigenvalues and stacking separate “\(x\)” optimization variables together into a matrix, along with elementary facts about traces.
Main Point: The main point I am making here is that Problem $(\mathcal{E}_k),$ (parameterized by the number \(k\) of vectors in \(U\)) which is a non-convex optimization problem, can be solved by computing an eigendecomposition of the symmetric matrix $A.$ This is remarkable, since there exist efficient algorithms for computing this factorization.
Generalized Eigenvalues
The eigenvalue problem of finding a pair \((\lambda, v)\) for a matrix \(A\) such that \(Av = \lambda v\) can be generalized in various ways. One particular direction is to consider a generalized eigenvalue problem (GEVP) associated to a pair of matrices \((A, B)\) wherein we seek to find \((\lambda, v)\) such that \(Av = \lambda Bv\). Such pairs are now referred to as generalized eigenpairs associated with \((A, B)\).
The case of most interest in applications is where \(A\) is symmetric, and \(B \succ 0\) is positive definite. Recall that a positive definite matrix can be characterized by the fact that they admit a Cholesky factorization \(B = LL^{\mathsf{T}}\) where \(L\) is an invertible lower-triangular matrix – this is a particular type of matrix square root. Using this Cholesky decomposition, the GEVP can be reduced to an ordinary EVP:
\begin{equation} \begin{aligned} Av &= \lambda Bv\\ \iff A L^{-\mathsf{T}} L^{\mathsf{T}} v &= \lambda LL^{\mathsf{T}} v\\ \iff \bigl(L^{-1} A L^{-\mathsf{T}}) w &= \lambda w;\ w = L^{\mathsf{T}} v, \end{aligned}\notag \end{equation}
wherein \(S \overset{\Delta}{=} L^{-1} A L^{-\mathsf{T}}\) is a symmetric matrix with eigenpairs \((\lambda, w)\) from which the generalized eigenvectors \(v = L^{-\mathsf{T}} w\) can be recovered. (Remark: I avoid the computation $B^{-1} A$ since this need not be symmetric.)
To understand the intuition of this situation, we first apply the fact that the eigenvectors \(W = \begin{bmatrix} w_1\ \cdots\ w_n \end{bmatrix}\) of \(S\) will form an orthogonal basis for \(\R^n\). Thus, since \(W = L^{\mathsf{T}} V\) we must have
\begin{equation} \begin{aligned} W^{\mathsf{T}} W &= I\\ \implies V^{\mathsf{T}} LL^{\mathsf{T}} V &= I\\ \implies V^{\mathsf{T}} B V &= I. \end{aligned}\notag \end{equation}
Thus, it becomes natural to construct the associated trace optimization problems for the pair \((A, B)\)
\begin{equation} \begin{aligned} \underset{U \in \R^{n \times k}}{\text{minimize}}&\ \mathsf{tr}\ U^{\mathsf{T}} A U\\ \text{subject to}&\ U^{\mathsf{T}} B U = I. \end{aligned}\tag{${\mathcal{G}_k}$} \end{equation}
To reiterate, the point here is that this is a class of non-convex optimization problems which can be efficiently solved by eigenvalue algorithms. Indeed, the value of this problem is given by \[\lambda_1(A, B) + \cdots + \lambda_k(A, B),\] the sum of the first \(k\) generalized eigenvalues, and the optimizing matrix \(U \in \R^{n \times k}\) constitutes the first \(k\) generalized eigenvectors of \((A, B)\). Keep in mind that the problem is only well defined when \(A\) is symmetric, and \(B\) is positive-definite (though a reasonable solution for the semidefinite case can probably still be worked out). If these assumptions do not hold, the situation becomes substantially more complicated.
Interpretations and Intuition
The generalized eigenvectors \(V = \begin{bmatrix}v_1\ \cdots\ v_n \end{bmatrix}\) are still a basis for \(\R^n\) (since \(L\) is invertible) but they are no longer orthogonal. Instead, they are orthogonal with respect to an inner product modified by the matrix $B.$ That is, if we write \(x^{\mathsf{T}} y \overset{\Delta}{=} \lang x, y \rang\) then we might generalize this by writing \[\lang x, y \rang_{B^{-1}} = x^{\mathsf{T}} B y = \lang L^{\mathsf{T}} x, L^{\mathsf{T}} y \rang,\] where the inner product \(\lang \cdot, \cdot \rang_{B^{-1}}\) (which, recall, describes the geometry of the space through the relationship to angles \(||x|| ||y|| \mathsf{cos}\ \theta_{xy} = \lang x, y \rang\) and distances \(||x|| = \sqrt{\lang x, x \rang}\)) has modified the geometry of \(\R^n\). Using a subscript \(B^{-1}\), rather than just \(B\) for this inner product is, I believe, the convention.
We’ve now equipped \(\R^n\) with two different inner products – the ordinary one \(\langle x, y \rangle = x^{\mathsf{T}}y\), and the new one \(\langle x, y \rangle_{B^{-1}} = x^{\mathsf{T}} B y\). To gain some intuition for this new inner product, consider the ordinary euclidean ball \(\mathbb{B} = \{x \in \R^n\ |\ ||x||_2 \le 1\}\) and the ball in the norm \(||x||_{B^{-1}} = \sqrt{x^{\mathsf{T}} B^{} x}\), namely \[\mathbb{B}_B = \{x \in \R^n\ |\ ||x||_{B^{-1}} \le 1\},\] which is an ellipsoid. Since if \(y = L^{-\mathsf{T}} x\) and \(||x||_2 \le 1\) then \(||L^{\mathsf{T}} y||_2 = ||y||_{B^{-1}} \le 1\) the mapping \(x \mapsto L^{-\mathsf{T}}\) transforms the ordinary euclidean ball into \(\mathbb{B}_B\) as in \(L^{-\mathsf{T}} \mathbb{B} = \mathbb{B}_B\).
To visualize the ball \(L^{-\mathsf{T}} \mathbb{B}\) recognize that if \(u_i\) is an (ordinary) eigenvector of \(B\), then \(u_i^{\mathsf{T}} B^{} u_i = \lambda_i(B)\) and therefore any vector proportional to \(u_i\) and with an euclidean length between \(0\) and \(1 / \sqrt{\lambda_i(B)}\) will be inside of the ball. Essentially what this means is that the eigenvectors \(u_i\) of \(B\) define the principle axes of the ellipse, each of which have a length \(1 / \sqrt{\lambda_i(B)}\). After stretching the space through the mapping \(x \mapsto L^{-\mathsf{T}}x\) the directions associated with large eigenvalues are shortened, and directions associated with small eigenvalues of \(B\) are lengthened. Another way to think about it is through a topographic map – the terrain in the direction of small eigenvalues is very steep.
The figure above depicts the transformation of Euclidean space \(\R^n\) by the mapping \(x \mapsto L^{-\mathsf{T}}x\). The dotted lines are points in the original space \(\R^n\), and the solid lines are the result of applying the transformation. The ellipse in particular includes eigenvectors of \(B\) (as calculated by scipy.linalg.eigh) scaled by \(1 / \sqrt{\lambda_i(B)}\) in white. The contours are drawn for the function \(x \mapsto \frac{1}{2}x^{\mathsf{T}} B x\) and I’ve used the particular matrix \[B = \begin{bmatrix}5 & 1 \\ 1 & 1 \end{bmatrix},\] which has an eigendecomposition \[B \approx \begin{bmatrix}0.230 & -0.973 \\ -0.973 & -0.230 \end{bmatrix}\begin{bmatrix}0.764 & 0 \\ 0 & 5.236 \end{bmatrix}\begin{bmatrix}0.230 & -0.973 \\ -0.973 & -0.230 \end{bmatrix}^{\mathsf{T}}.\]
Applications and Computations
Practically speaking, a reliable means of solving generalized eigenvalue problems is by means of scipy.linalg.eig (which will try to solve \(Av = \lambda B v\) for any square \(A, B\)) and scipy.linalg.eigh (which assumes \(A\) is symmetric, \(B\) is positive definite and always returns a real factorization). These functions also provide the option to (efficiently) compute and return only a subset of eigenvalues (along with corresponding eigenvectors). Moreover, this subset can be defined either by indices (e.g., return the third, fourth, and fifth) or by values (e.g., return eigenvalues between \(0\) and \(1\)). These are incredibly useful featuers in practice.
Principle Component Analysis
The most famous application of trace optimization problems is the feature-extraction or dimensionality-reduction technique known as Principle Component Analysis (PCA). Suppose we have a random variable \(X \in \R^n\) with mean zero \(\mathbb{E}[X] = 0\) and variance \(\mathbb{V}[X] = \Sigma \succ 0\). The goal of PCA is to find a matrix \(V \in \R^{n \times k}\) having orthogonal columns (so that \(V^{\mathsf{T}} V = I\)) and such that the total variance of \(Y = V^{\mathsf{T}} X\) is maximized. The number \(k\) of columns of \(V\) is the key here – we will usually choose \(k \ll n\) so that the idea of maximizing the variance corresponds to searching for the \(k\) “most explanatory” directions in \(\R^n\). Another way to think about this is that we are finding a subspace of dimension \(k\) which “explains” the greatest amount of variability in \(X\).
It is easy to set this problem up as a trace optimization problem. The total variance of \(Y\) (which is necessarily also zero mean) is given by
\begin{equation} \begin{aligned} \mathbb{E} Y^{\mathsf{T} }Y &= \mathbb{E} \mathsf{tr}\ YY^{\mathsf{T}}\\ &= \mathbb{E} \mathsf{tr}\ V^{\mathsf{T}} X X^{\mathsf{T}} V\\ &= \mathsf{tr}\ V^{\mathsf{T}} \Sigma V.\\ \end{aligned}\notag \end{equation}
Thus, the problem comes down to solving:
\begin{equation} \begin{aligned} \underset{V \in \R^{n \times k}}{\text{maximize}}&\ \mathsf{tr}\ V^{\mathsf{T}} \Sigma V\\ \text{subject to}&\ V^{\mathsf{T}} V = I, \end{aligned}\notag \end{equation}
which is a standard trace optimization problem. The solution of which is, famously, given by \(k\) eigenvectors corresponding to the \(k\) largest eigenvalues of \(\Sigma\).
Remark: Given an actual data matrix $X \in \R^{N \times n}$, one might naturally think to just estimate a covariance matrix $\widehat{\Sigma} \approx \frac{1}{N} X^{\mathsf{T}} X$. However, this is a mistake. The eigendecomposition of $\widehat{\Sigma}$ can be computed directly from a Singular Value Decomposition of the matrix $X$ itself. Indeed, for most purposes I am aware of, the actual computation of $X^{\mathsf{T}} X$ is unnecessary and should not be performed in practice.
Weighted Principle Component Analysis
The choice in the last section of maximizing the total variance of \(Y = V^{\mathsf{T}} X\) was arbitrary. There is a natural generalization of this formulation where we will still find a matrix \(V\) to compute some projection \(Y = V^{\mathsf{T}} X\) as usual, but now we will weight the “importance” of the various components of \(X\) by some matrix \(W\) (which is often likely to be diagonal, though this is not a requirement). A natural objective is then the quantity \(\mathbb{E}||V^{\mathsf{T}} W^{1/2} X||_2^2 = \mathsf{tr}\ V^{\mathsf{T}} W^{1/2} \Sigma W^{1/2} V\). This objective can be plugged into another trace optimization problem with constraint \(V^{\mathsf{T}} V = I\).
Interestingly, if we absorb the factor \(W^{1/2}\) into \(V\) as in \(\tilde{V}_W = W^{1/2} V\) we can re-write the constraint in terms of \(\tilde{V}_W\) as in \(V^{\mathsf{T}} V = \tilde{V}_W W^{-1} \tilde{V}_W = I\). That is, the matrix \(\tilde{V}_W\) will be orthogonal with respect to the inner product \(\lang\cdot, \cdot\rang_W\).
The PCA weighting scheme will not necessarily make any significant difference in regards to the estimated or reconstructed vector \(\widehat{X} = \sum_{k = 1}^K \lang v_i, X \rang_W W^{-1} v_i\), particularly if both the covariance matrix of \(X\) and the weight matrix \(W\) are diagonal. The differences arise when there is significant correlation structure in \(X\), in which case the weighting scheme can substantively impact the correlation structure of the factors \(v_i\).
Remark: When reconstructing an estimate of an image (or other data) from a weighted PCA, one must be careful to carry out the linear projections with respect to the weighted inner product, otherwise the results will be nonsensical.
Rayleigh Quotients and Fisher’s Linear Discriminant
Principle component analysis is a ubiquitous technique for dimensionality reduction. However, it suffers a major drawback in that it is completely unsupervised, whereas data often come attached with class labels. For example, we may have a random variable \(X\) and, associated with \(X\) is a random class label \(Y \in [C] = \{1, \ldots, C\}\) indicating that \(X\) belongs to one of \(C\) separate classes. These classes may represent a data source, a target label, or be simply some discrete feature of the data.
For some typical concrete examples, ponder these possibilities:
- The data \(X\) are natural language embeddings and the labels \(Y\) are the language (English, French, Persian, …)
- Data \(X\) represent stock returns, labels \(Y\) indicate the stock’s industry
- \(X\) is real-valued features of a house, and the class label \(Y\) is a categorical variable such as the neighbourhood.
In PCA, the projection matrix \(V\) is constructed from analyzing \(X\) alone, disregarding the class label \(Y\). Ideally, we would use the information available in \(Y\) to construct \(V\), and Fisher’s Linear Discriminant (FLD) is one particular method to do so. The idea is to find the projection \(V\) such that samples \(X_i\) which share a class label \(Y_i\) are clustered close to each other, and at the same time that clusters formed by samples with different class labels are far apart.
Quantitatively, first suppose that we’ve centered our data \(\mu = \mathbb{E}[X] = 0\). Then, to establish more notation, let \(\mu_c = \mathbb{E}[X\ |\ Y = c]\) and \(\Sigma_c = \mathbb{E}[(X - \mu_c)(X - \mu_c)^{\mathsf{T}}\ |\ Y = c]\) be the class conditional mean and variance of \(X\), respectively. Finally, \(\mathbb{P}(Y = c) = \pi_c\) be the probability that each data sample is from class \(c\).
Now, just as we did for principle component analysis, let’s consider the variance of the projection \(V^{\mathsf{T}} X\). To bring out the class labels, we will condition on the value of \(Y\):
\begin{equation} \begin{aligned} \mathbf{Var}[V^\mathsf{T} X] &= V^{\mathsf{T}} \mathbb{E}(X - \mathbb{E}[X])(X - \mathbb{E}[X])^{\mathsf{T}} V\\ &= V^{\mathsf{T}} \mathbb{E}[XX^{\mathsf{T}}] V\\ &= V^{\mathsf{T}} \mathbb{E}\mathbb{E}[XX^{\mathsf{T}} \ |\ Y]V\\ &= V^{\mathsf{T}} \sum_c \pi_c \mathbb{E}[XX^{\mathsf{T}} \ |\ Y = c]V\\ &\overset{(a)}{=} V^{\mathsf{T}} \sum_c \pi_c \bigl(\mathbf{Var}[X\ |\ Y = c] + \mathbb{E}[X\ |\ Y = c]\mathbb{E}[X\ |\ Y = c]^{\mathsf{T}}\bigr) V\\ &= V^{\mathsf{T}} \sum_c \pi_c \bigl(\Sigma_c + \mu_c \mu_c^{\mathsf{T}} \bigr) V, \end{aligned}\notag \end{equation}
where \((a)\) is rearranging the variance formula \(\mathbf{Var}(Z) = \mathbb{E}[ZZ^{\mathsf{T}}] - \mathbb{E}[Z]\mathbb{E}[Z]^{\mathsf{T}}\). The same result could be got by applying the Law of Total Variance , which is what I did first before seeing that the above calculation is easier.
In the context of FLD, the term \(\Sigma_w = \sum_c \pi_c \Sigma_c\) is a measure of the average within-class variance (i.e., the variance of the data conditioned on a class label) and \(\Sigma_b = \sum_c \pi_c \mu_c \mu_c^{\mathsf{T}}\) is the between-class variance (i.e., a measure of the separation between the classes).
The key idea of FLD is to maximize the within class variance (of the projection) \(V^{\mathsf{T}} \Sigma_w V\), and minimize the between class variance \(V^{\mathsf{T}} \Sigma_b V\). Projections which achieve this goal are identifying the axes which simultaneously explain the differences and similarities between classes.
To operationalize what it means to minimize the within-class variance, and maximize the between-class variance we need two things: (1) we need to turn the matrix-valued variances into a scalar and (2) we need to define the tradeoff between maximizing and minimizing. There are surely many ways of doing this, but the choice of objective function made by FLD is to minimize the following scalar cost:
\begin{equation} J(V) = \mathsf{tr} (V^{\mathsf{T}} \Sigma_w V)^{-1} (V^{\mathsf{T}} \Sigma_b V). \end{equation}
The trace serves to turn the objective into a scalar, and inverting the within-class variance means that the minimization objective \(J\) will lead us to attempt to maximize that wiithin-class variance.
While the function \(J(V)\) obviously looks like it has a lot of interesting structure, it may not be immediately clear what to do with it or how to optimize it (certainly, it is not a convex function of \(V\)). However, we can recognize an analogy – suppose that \(V\) is just a vector, and that \(\Sigma_b = I\). In this case, \(J(v) = \frac{v^{\mathsf{T}} \Sigma_w v}{v^{\mathsf{T}} v}\), which is a Rayleigh Quotient for the matrix \(\Sigma_w\). The reader familiar with how to connect Rayleigh quotients to eigenvalues should be inspired – the Rayleigh Quotient is invariant to rescaling of the vector, so let’s consider a change of basis matrix \(C \in \mathbb{R}^{k \times k}\) (i.e., an invertible matrix) and check the objective:
\begin{equation} \begin{aligned} J(VC) &= \mathsf{tr}\ (C^{\mathsf{T}} V^{\mathsf{T}} BV C)^{-1} (C^{\mathsf{T}} V^{\mathsf{T}} A V C)\\ &= \mathsf{tr}\ C^{-1} (V^{\mathsf{T}} BV)^{-1} C^{-\mathsf{T}} C^{\mathsf{T}} (V^{\mathsf{T}} A V) C\\ &= \mathsf{tr}\ C^{-1} (V^{\mathsf{T}} BV)^{-1} C^{-\mathsf{T}} C^{\mathsf{T}} (V^{\mathsf{T}} A V) C\\ &= \mathsf{tr}\ (V^{\mathsf{T}} BV)^{-1} (V^{\mathsf{T}} A V) \\ &= J(V). \end{aligned}\notag \end{equation}
Therefore, the objective is invariant to a change of basis (just like ordinary Rayleigh Quotients are invariant to rescalings) so we might as well impose the convenient constraint that \(V^{\mathsf{T}} \Sigma_w V = I_k\) since if \(U\) is some minimizer of \(J\) which does not satisfy this condition, we can rescale \(U\) so that it does. Thus, we are left with the GEVP:
\begin{equation} \begin{aligned} \underset{V \in \mathbb{R}^{n \times k}}{\text{minimize}}&\ \mathsf{tr}(V^{\mathsf{T}} \Sigma_b V)\\ \text{subject to}&\ V^{\mathsf{T}} \Sigma_w V = I_k, \end{aligned}\notag \end{equation}
the solutions of which provide a matrix \(V \in \R^{n \times k}\) which can be used to reduce the dimensionality of the data \(X\) such that the different classes are well separated in the lower dimensional space.
Remark: Notice that optimizing the Rayleigh quotient provides another variational characterization of the generalized eigenvalues!
There is however an important caveat. Notice that \(\Sigma_w = \sum_c \pi_c \mu_c \mu_c^{\mathsf{T}}\) is a sum of \(C\) rank-1 matrices. This matrix therefore has rank at most \(C\), and then \(V^{\mathsf{T}} \Sigma_w V\) must also have rank at most \(C\), though it is of dimension \(k \times k\) (where \(V\) is \(n \times k\) and \(k \le n\)). Thus, we need to be careful that \(k \le C\) (smaller for degenerate cases), otherwise the inverse is not well defined – this places a restriction on the dimensionality of the space onto which we project.
Example
One of the most natural datasets on which to apply this technique is the classic MNIST handwritten digits data , for which I’ve just used the data shipped with scikit-learn . This dataset consists of \(1797\) examples of \(8 \times 8\) images of written digits between \(0\) and \(9\). Pixel intensities are between \(0\) and \(16\).
Simply flattening out the \(8 \times 8\) images into a vector gives us a data matrix \(X \in \R^{1797 \times 64}\), which I’ll project into a lower dimensional subspace using either FLD or PCA – the labels \(y \in \{0, \ldots, 9\}^{1797}\) are natural class labels for FLD.
Calculating the projections onto a \(K \ll 64\) dimensional subspace (after centering and standardizing the \(X\) matrix) results in two new datasets \(\widehat{X}_{\text{pca}} \in \R^{1797 \times K}\) and \(\widehat{X}_{\text{fld}} \in \R^{1797 \times K}\), the former being the projection onto the axes which maximizes the variance, and the latter being constructed to try to separate the clumps of classes.
In order to make a nice visual comparison between these two projections, I’d like to try to “line up” the projected data as well as possible so that if the data were to fall into similar clumps for PCA and FLD, that these clumps would appear in similar locations in a visualization. To do so, I’ve rotated the matrix \(\widehat{X}_{\text{fld}}\) with an orthogonal matrix \(Q\) obtained from solving an Orthogonal Procrustes Problem .
Orthogonal Procrustes Problem
This is a bonus problem which is similar in spirit to a trace optimization problem, but takes on a different form and has a different solution than the problems I’ve considered so far. Specifically, we solve
\begin{equation} \begin{aligned} \underset{Q \in \R^{n \times n}}{\text{minimize}} &\quad \frac{1}{2}||Q\widehat{X}_{\text{fld}} - \widehat{X}_{\text{pca}}||_F^2\\ \text{subject to} &\quad Q^\mathsf{T} Q = I, \end{aligned}\notag \end{equation}
so that \(\widehat{X}_{\text{fld}}\) is rotated (and possibly reflected) to better line up with \(\widehat{X}_{\text{pca}}\). We can make this look like a trace optimization problem by expanding the objective as in
\begin{equation} \begin{aligned} \frac{1}{2}||Q\widehat{X}_{\text{fld}} - \widehat{X}_{\text{pca}}||_F^2 &= \frac{1}{2}\mathsf{tr}(\widehat{X}_{\text{fld}}^{\mathsf{T}}Q^{\mathsf{T}}Q\widehat{X}_{\text{fld}} + \widehat{X}_{\text{pca}}^{\mathsf{T}}\widehat{X}_{\text{pca}}) - 2\mathsf{tr}(Q^{\mathsf{T}} \widehat{X}_{\text{pca}}\widehat{X}_{\text{fld}}^{\mathsf{T}})\\ &\overset{(a)}{=} \frac{1}{2}\mathsf{tr}(\widehat{X}_{\text{fld}}^{\mathsf{T}}\widehat{X}_{\text{fld}} + \widehat{X}_{\text{pca}}^{\mathsf{T}}\widehat{X}_{\text{pca}}) - \mathsf{tr}(Q^{\mathsf{T}} \widehat{X}_{\text{pca}}\widehat{X}_{\text{fld}}^{\mathsf{T}})\\ \end{aligned}\notag \end{equation}
where \((a)\) uses the orthogonality constraint. This leaves us with a sort of “one-sided” trace optimization problem, distinct from the problems encountered so far:
\begin{equation} \begin{aligned} \underset{Q \in \R^{n \times n}}{\text{maximize}} &\quad \mathsf{tr}(Q^{\mathsf{T}} \widehat{X}_{\text{pca}}\widehat{X}_{\text{fld}}^{\mathsf{T}})\\ \text{subject to} &\quad Q^\mathsf{T} Q = I. \end{aligned}\notag \end{equation}
To actual see the solution of this problem, expand out the related term
\begin{equation} \frac{1}{2}||Q - \widehat{X}_{\text{pca} }\widehat{X}_{\text{fld}}^{\mathsf{T}}||_F^2 = \frac{1}{2}\mathsf{tr}(Q^{\mathsf{T}}Q + \widehat{X}_{\text{fld}}\widehat{X}_{\text{pca}}^{\mathsf{T}}\widehat{X}_{\text{pca}} \widehat{X}_{\text{fld}}^{\mathsf{T}}) - \mathsf{tr}(Q^{\mathsf{T}} \widehat{X}_{\text{pca}}\widehat{X}_{\text{fld}}^{\mathsf{T}}),\notag \end{equation}
which similarly involves constants and the above trace objective. We can therefore equivalently solve \[\underset{Q: Q^{\mathsf{T}}Q = I}{\text{minimize}}\ \frac{1}{2}||Q - \widehat{X}_{\text{pca} }\widehat{X}_{\text{fld}}^{\mathsf{T}}||_F^2.\] Apply a Singular Value Decomposition \(\widehat{X}_{\text{pca} }\widehat{X}_{\text{fld}}^{\mathsf{T}} = U\Sigma V^{\mathsf{T}}\) to the product, and using the fact that \(U, V\) are orthogonal, and therefore we can freely multiply by \(U, V\) inside the norm without change its value, leads to the solution \(Q = UV^{\mathsf{T}}\), with cost \(\sum_{i = 1}^m (1 - \sigma_i)^2\) for singular values \(\sigma_i\) of \(\widehat{X}_{\text{pca} }\widehat{X}_{\text{fld}}^{\mathsf{T}}\). Precisely,
\begin{equation} \begin{aligned} \underset{Q: Q^{\mathsf{T}}Q = I}{\text{min}}\ \frac{1}{2}||Q - \widehat{X}_{\text{pca}}\widehat{X}_{\text{fld}}^{\mathsf{T}}||_F^2 &= \underset{Q: Q^{\mathsf{T}}Q = I}{\text{min}}\ \frac{1}{2}||Q - U\Sigma V^{\mathsf{T}}||_F^2\\ &\overset{(a)}{=} \underset{D: D^{\mathsf{T}}D = I}{\text{min}}\ \frac{1}{2}||UDV^{\mathsf{T}} - U\Sigma V^{\mathsf{T}}||_F^2\\ &\overset{(b)}{=} \underset{D: D^{\mathsf{T}}D = I}{\text{min}}\ \frac{1}{2}||D - \Sigma||_F^2\\ &\overset{( c)}{=} \frac{1}{2}\sum_{i = 1}^m (1 - \sigma_i)^2, \end{aligned}\notag \end{equation}
where \((a)\) expresses \(Q\) in the same basis (\(D\) is not yet necessarily diagonal! e.g., consider \(D = U^{\mathsf{T}} Q V\)), and \((b)\) uses the fact that \(U, V\) are orthogonal. Minimizing the approximation error of the diagonal matrix \(\Sigma\) after equality \((b)\) now requires that \(D\) be diagonal, and meeting the equality (which is derived from the requirement that \(Q^{\mathsf{T}}Q = I\)) necessitates (equality \(( c)\)) that $D = I.$ \(\ \square\)
I’ve used this matrix \(Q\) to better align the data in the animation below. Notice that each coloured clump (corresponding to different digits) is roughly aligned in space – there is nothing intrinsic about PCA and FLD that would encourage this alignment.
In addition to the figure, I also fit a simple sklearn.svm.LinearSVC to the projected data in order to verify that the clusters are in fact well informed by the class labels. The MCC value is annotated in the title of the each figure.
Optimizing Signal to Noise Ratio
As a final example, let’s examine a problem in signal processing. Let \(X \in \R^{n}\) be an unobserved zero mean random vector with variance \(\Sigma\) and \(E \in \R^{n}\) be another zero-mean and unobserved random vector with variance \(\Omega\). The observed quantity is given by \(Y = X + E\), i.e., a noisy measurement of the signal of interest.
Similarly to PCA, we will find a matrix \(V \in \R^{n \times k}\) to project \(Y\) onto some lower dimensional space, but this time we will optimize the signal-to-noise ratio (SNR), where the “signal” is \(V^{\mathsf{T}} X\) (containing information about the vector \(X\) that we care about) and the “noise” is \(V^{\mathsf{T}} E\) (which is just corrupted by \(E\)). This leads to the optimization problem:
\begin{equation} \begin{aligned} \underset{V \in \R^{n \times k}}{\text{maximize}}&\ \frac{\mathbb{E}||V^{\mathsf{T}} X||_2^2}{\mathbb{E}||V^{\mathsf{T}} E||_2^2}\\ \text{subject to}&\ V^{\mathsf{T}} V = I. \end{aligned}\tag{$\mathcal{P}$} \end{equation}
This is a generalization of PCA since if \(E\) is isotropic, \(\mathbb{E}||V^{\mathsf{T}} E||_2^2\) is a constant given the constraint, and can be dropped. Moreover, this problem is different from the GEVPs encountered earlier, because we enforce the constraint that \(V^{\mathsf{T}}V = I\), as opposed to \(V^{\mathsf{T}}\Omega V = I\).
While this problem doesn’t immediately take the form of an eigenvalue problem, we can relate it to one with some clever manipulations (essentially drawn from Kokiopoulou et. al. ). Supposing that \(s^\star > 0\) is the optimal SNR, it must be the case that \[\frac{\mathbb{E}||V^{\mathsf{T}} X||_2^2}{\mathbb{E}||V^{\mathsf{T}} E||_2^2} \le s^\star\] for every orthogonal matrix $V.$ Therefore, denoting by \(\mathcal{O}_k\) the set of \(n \times k\) matrices with orthogonal columns, we have
\begin{equation} \begin{aligned} \forall V \in \mathcal{O}_k:\ \mathbb{E}||V^{\mathsf{T}} X||_2^2 - s^\star \mathbb{E}||V^{\mathsf{T}} E||_2^2 &\le 0\\ \implies \forall V \in \mathcal{O}_k:\ \mathbb{E} \mathsf{tr}\ V^{\mathsf{T}} XX^{\mathsf{T}} V - s^\star \mathbb{E} \mathsf{tr}\ V^{\mathsf{T}} EE^{\mathsf{T}} V & \le 0\\ \implies \forall V \in \mathcal{O}_k:\ \mathsf{tr}\ V^{\mathsf{T}} \Sigma V - s^\star \mathsf{tr}\ V^{\mathsf{T}} \Omega V &\le 0\\ \implies \forall V \in \mathcal{O}_k:\ \mathsf{tr}\ V^{\mathsf{T}} \bigl(\Sigma - s^\star \Omega\bigr) V &\le 0\\ \implies \underset{V \in \mathcal{O}_k}{\text{max}}\ \mathsf{tr}\ V^{\mathsf{T}} \bigl(\Sigma - s^\star \Omega\bigr) V &\le 0.\\ \end{aligned} \end{equation}
From what we know of eigenvalue problems, the maximizer \(V\) in \(\mathcal{O}_k\) of \(\mathsf{tr}\ V^{\mathsf{T}} \bigl(\Sigma - s \Omega \bigr) V\) is given by the last \(k\) eigenvectors of the matrix \(\Sigma - s \Omega\). Lets call these \(V_s\). We want to maximize the SNR, and thus find the largest \(s\) which satisfies the above inequalities – i.e., the optimal value \(s^\star\) should be a root of the function \[\mathcal{S}(s) = \mathsf{tr}\ V^{\mathsf{T}}_s \bigl( \Sigma - s \Omega \bigr) V_s,\] a function which can be evaluated through an eigendecomposition of \(\Sigma - s \Omega\). (Remark: expressions like $\Sigma - s\Omega$ involving a matrix minus a scalar times another matrix are often referred to as matrix pencils. I am told they are named after the concept in geometry .)
To check that this can actually occur, notice that since \(\Omega \succ 0\), the function \(\mathcal{S}\) must be monotone decreasing on \(\R_+\). To establish the existence of an \(s^\star\) such that \(\mathcal{S}(s^\star) = 0\), and to apply bisection search, we need to show that \(\mathcal{S}\) does actually cross \(0\). To this end, \(\mathcal{S}(0) > 0\) (since \(\Sigma \succ 0\)). Since \(\mathcal{S}\) is strictly monotone, this is enough, but we’d like to find a compact interval where we know \(s^\star\) lies. I show in the detail below that \(\mathcal{S}(s) < 0\) for any \(s > \lambda_n(\Sigma, \Omega)\), the largest generalized eigenvalue.
Generalized Eigenvalue Decomposition
Supposing that \(\Lambda = \mathsf{Dg}(\lambda_1, \ldots, \lambda_n)\) is a diagonal matrix of generalized eigenvalues and \(V \in \mathbb{R}^{n \times n}\) contains the generalized eigenvectors in the columns, we can write the GEVP as \[\Sigma V = \Omega V \Lambda.\] Then, since \(V^{\mathsf{T}} \Omega V = I\) we have \(VV^{\mathsf{T}} \Omega V = V\) which, since \(V\) is invertible (we know it is a basis for \(\R^n\)) implies that \(VV^{\mathsf{T}} \Omega = I\). Multiplying the GEVP on the right by \(V^{\mathsf{T}} \Omega\) then provides to us a generalized eigenvalue decomposition of \(\Sigma\) with respect to \(\Omega\): \[\Sigma = \Omega V \Lambda V^{\mathsf{T}} \Omega.\]
We can use these facts to analyze the function \(\mathcal{S}\) as in
\begin{equation} \begin{aligned} \mathcal{S}(s) &= \mathsf{tr}\ (\Sigma - s\Omega)\\ &= \mathsf{tr}\ (\Omega V \Lambda V^{\mathsf{T}} \Omega - s\Omega)\\ &= \mathsf{tr}\ \Omega (V \Lambda V^{\mathsf{T}} \Omega - sI)\\ &= \mathsf{tr}\ \Omega (V \Lambda V^{\mathsf{T}} \Omega - sVV^{\mathsf{T}} \Omega)\\ &= \mathsf{tr}\ \Omega V (\Lambda - sI) V^{\mathsf{T}} \Omega \\ &= \mathsf{tr}\ (\Lambda - sI)V^{\mathsf{T}} \Omega^2 V, \\ \end{aligned}\notag \end{equation}
and since \(V^{\mathsf{T}} \Omega^2 V \succ 0\) we must have \(\mathcal{S}(s) < 0\) for any \(s > \lambda_n\). \(\ \square\)
Now, it follows that \(s^\star\) is the unique root of the monotone function \(\mathcal{S}\) and it lies in the interval \([0, \lambda_n(\Sigma, \Omega)]\). Thus, we can find \(s^\star\) by performing bisection on \(\mathcal{S}\), and then recover the optimal matrix \(V_{s^\star}\) by computing the last \(k\) eigenvectors of \(\Sigma - s^\star \Omega\). The value \(s^\star\), remember, is also the value of the optimal SNR.
For the Reader: I’ve glossed over another property that should be verified for $\mathcal{S}$ before making these conclusions. What is it and why does it hold?
An example of this technique is given for the same faces dataset as the PCA example, where I’ve corrupted the images with noise that is locally correlated in space. Clearly the result is not particularly good, in an absolute sense, but the PCA reconstruction is completely corrupted by the anisotropic noise distribution.
Some code to implement this technique in Julia is below.
|
|
Conclusion
This blog examines the theory and applications of eigenvalue problems and generalized eigenvalue problems. We’ve seen through the Spectral Theorem that the symmetry of the matrix is essential for obtaining real-valued eigenvalue decompositions (and thus an ordering to the eigenvalues). These eigenvalues can be characterized by variational problems, and useful applied problems (particularly PCA and its weighted version, and FLD) can be formulated in the same optimization problems.
The key implication of this is that these optimization problems admit of efficient solutions: eigenvalue decompositions. These optimization problems are by no means convex, yet they still admit of efficient solutions through the use of scipy.linalg.eigh.
We’ve also seen some “bonus” relationships with the singular value decomposition in the form of the Orthogonal Procrustes Problem, and the fact that you can compute the eigenvalues needed for PCA directly from \(X\), rather than having to form \(\widehat{\Sigma} = \frac{1}{N}X^{\mathsf{T}}X\). As well, the detailed computations for signal to noise ratio optimization show us the generalized eigenvalue decomposition \(\Sigma = \Omega V \Lambda V^{\mathsf{T}}\Omega\) of \(\Sigma\) with respect to \(\Omega\). This is another useful decomposition which generalizes the “vanilla” eigenvalue decomposition \(\Sigma = U \Lambda U^{\mathsf{T}}\).
These problems present plenty of further interesting avenues for exploration, which I invite the reader to meditate upon :)