Quant Out of Water
All About Quadratic Forms

All About Quadratic Forms

Quadratic forms are functions defined through symmetric matrices and represent a ubiquitous class of functions for which there is an enormous amount of useful theoretical and computational results. Indeed, in “linear-quadratic” models, it is possible to provide analytic solutions to more-or-less any question you would like to ask. This post is a tour of some foundations and results relating to quadratic forms.

Introduction

Quadratic forms are functions \(f: \R^n \rightarrow \mathbb{R}\) defined by a constant \(\kappa \in \mathbb{R}\), a vector \(q \in \mathbb{R}^n\) and a matrix \(Q \in \mathbb{S}^n\) (\(\mathbb{S}^n\) denoting the set of symmetric \(n \times n\) matrices), namely, \(f(x) = \frac{1}{2}x^{\mathsf{T}} Q x + q^{\mathsf{T}} x + \frac{1}{2}\kappa\). That the matrix \(Q\) is assumed symmetric is without loss of generality since if \(Q\) were not symmetric, we could instead consider \(\frac{1}{2}Q + \frac{1}{2}Q^{\mathsf{T}}\) without affecting the value of the function. The \(\frac{1}{2}\)’s in the definition of \(f\) will be seen to be a convenient convention.

Firstly, we can recognize that it is possible to expand this out explicitly as \[f(x) = \frac{1}{2}\sum_{i = 1}^n \sum_{j = 1}^n Q_{ij}x_i x_j + \sum_{i = 1}^n q_i x_i + \frac{1}{2}\kappa.\] Thus, the function \(f\) is defined through a linear combination of all pairs of products in \(x\), but not higher order powers. Another convenient way to write QFs is through a single matrix, augmenting \(x\) with a \(1\)

\begin{equation} f(x) = \frac{1}{2}\begin{bmatrix}x\\ 1\\ \end{bmatrix}^{\mathsf{T}}\begin{bmatrix}Q & q\\ q^{\mathsf{T}} & \kappa \end{bmatrix}\begin{bmatrix}x\\ 1 \end{bmatrix}, \notag \end{equation}

so for the most part, techniques that work for the simpler case with \(q = 0\) and \(\kappa = 0\), also apply to the more general case.

Assumption: For the purposes of this post, I will usually assume that $Q$ is invertible. This assumption is not essential, but if this does not hold, then the quadratic form is degenerate and breaks down into a lower-dimensional full-rank quadratic form (restricted to the span of the columns of $Q$) and a linear form restricted to the perpendicular complement thereof. It is analogous to the assumption that $a \ne 0$ in the classical quadratic polynomial $ax^2 + bx + c$.

From Whence?

Where do quadratic forms arise from? At the most superficial level, they are nothing but a natural multivariate generalization of the quadratic polynomial \(ax^2 + bx + c\). However, they also arise in extremely natural ways in numerous applications. They are truly ubiquitous.

Least Squares: The least-squares cost function \(f(x) = \frac{1}{2}||y - Ax||_2^2\) is a quadratic form \(f(x) = \frac{1}{2}x^{\mathsf{T}} A^{\mathsf{T}} A x - (A^{\mathsf{T}} y)^{\mathsf{T}} x + \frac{1}{2}||y||_2^2\) where \(Q = A^{\mathsf{T}} A, q = A^{\mathsf{T}} y\) and \(\kappa = ||y||_2^2\). The least-squares cost function arises in statistical regression, estimation, as a design objective, etc.

Variance of a Sum: If we have some vector random variable \(Z\) with variance matrix \(\Sigma \in \mathbb{S}^n\), then the variance of the linear combination \(\text{Var}[Z^{\mathsf{T}} x] = x^{\mathsf{T}} \Sigma x\) is a quadratic form when viewed as a function of \(x\).

Local Approximation: Generally, differentiable functions \(f(x)\) can be expanded into a Taylor series about some point \(x_0\) as in \(f(x) = f(x_0) + \mathsf{D}f(x_0) (x - x_0) + \frac{1}{2} (x - x_0)^{\mathsf{T}} \mathsf{D}^2 f(x_0)(x - x_0) + R(x)\) where the remainder term \(R(x)\) is a \(3^{\text{rd}}\) order polynomial of the components of \((x - x_0)\). Truncating the \(R(x)\) term results in a quadratic form that gives a local approximation of the function \(f\) near the point \(x_0\). These approximations are used in (second order) iterative optimization algorithms – particularly Newton’s method, sequential least squares, and interior point algorithms.

Gaussian Distributions: The Gaussian distribution is one of the most important distribution functions in probability. It is characterized by two parameters \(\mu \in \mathbb{R}^n\) (the mean) and \(\Sigma \in \mathbb{S_{+}^n}\) (the positive semidefinite variance matrix). When \(\Sigma\) is positive definite (hence invertible) Gaussian random variables have the density function \(f(x) \propto \text{exp}\bigl(\frac{1}{2}(x - \mu)^{\mathsf{T}} \Sigma^{-1} (x - \mu) \bigr)\), which involves a quadratic form. Moreover, the characteristic function \(\psi(t) = \text{exp}\bigl(j\mu^{\mathsf{T}} t - \frac{1}{2}t^{\mathsf{T}} \Sigma t\bigr)\) is also a quadratic form in \(t\). (Remark: The definition of the Gaussian should most properly be given in terms of \(\psi\), instead of the density, since the expression for \(\psi\) does not require that \(\Sigma\) be invertible!)

Use in Optimization: A huge number of general optimization problems, each of which have their own space of applications, involve quadratic forms in the objective function. Examples are given by Quadratic Programming (including Support Vector Machines ), the Quadratic Assignment Problem , the Linear Quadratic Regulator , and many more.

Completing the Square

For the classical quadratic polynomial \(ax^2 + bx + c\), to “complete the square” is an elementary algebraic operation wherein we eliminate the linear term \(bx\) by writing \(ax^2 + bx + c = a(x + b / (2a))^2 + c - b^2 / (4a^2)\). This procedure enables one to derive the quadratic formula, identify minimizers and maximizers, etc. There is an analogous procedure for general quadratic forms:

\begin{equation} \begin{aligned} f(x) &= \frac{1}{2}x^{\mathsf{T}} Q x + q^{\mathsf{T}} x + \frac{1}{2}\kappa\\ &= \frac{1}{2}(x + Q^{-1} q)^{\mathsf{T}} Q (x + Q^{-1} q) + \frac{1}{2}\kappa - q^{\mathsf{T}} Q^{-1} q, \end{aligned}\notag \end{equation}

where the \(Q^{-1}\) can be replaced by a pseudo-inverse \(Q^\dagger\) if \(Q\) is not invertible but merely \(q \in \mathcal{R}(Q)\). Completing the square is a useful technique for derivations involving quadratic forms – particularly, it will help us to recognize minimizers of \(f\), and it is commonly used for manipulating expressions relating to Gaussian distributions. I solve a few optimization problems in this post, and nowhere do I resort to differentiation.

Optimization

Quadratic forms are extremely important in optimization. Since they are so tractable, many algorithms for solving general optimization problems proceed by solving quadratic optimization problems in sequence, iteratively refining the approximation as they proceed.

Convexity

There is a particular class of quadratic forms which are of particular interest in optimization – these are convex quadratic forms. We will see later that, more generally, quadratic forms can be decomposed into convex, concave, and linear subspaces.

Recall that a function \(f\) is convex if \[\forall t \in (0, 1), x, y \in \mathsf{dom}\ f: f(tx + (1 - t)y) \le tf(x) + (1 - t)f(y).\] This means that the line joining two points on the graph of \(f\) is above the graph itself. Intuitively, convex functions “look like bowls” and their importance stems, at least in part, from the fact that any local minimizer is also a global minimizer. Thus, for example, gradient descent applied to convex functions is guaranteed to converge to a global minimizer of the function. A function is concave if its negative is convex – these functions look like upside down bowls, and gradient ascent will converge to a global maximizer.

How do we determine when a quadratic form is convex? Without loss of generality, we consider simply the function \(f(x) = \frac{1}{2}x^{\mathsf{T}} Q x\) and calculate (notice that we need not resort to computing second order derivatives!):

\begin{equation} \begin{aligned} &\ f(tx + (1 - t)y) - tf(x) - (1 - t)f(y)\\ &= \frac{1}{2}t^2 x^{\mathsf{T}} Q x + t(1 - t) x^{\mathsf{T}} Q y + \frac{1}{2}(1 - t)^2 y^{\mathsf{T}} Q y - \frac{1}{2}tx^{\mathsf{T}}Q x - \frac{1}{2}(1 - t) y^{\mathsf{T}} Q y\\ &= \frac{1}{2}(t^2 - t) x^{\mathsf{T}} Q x + t(1 - t) x^{\mathsf{T}} Q y + \frac{1}{2}\bigl((1 - t)^2 - (1 - t)\bigr)y^{\mathsf{T}} Q y\\ &= -\frac{1}{2}t(1 - t) x^{\mathsf{T}} Q x + t(1 - t) x^{\mathsf{T}} Q y - \frac{1}{2}t (1 - t) y^{\mathsf{T}} Q y \\ &= -\frac{1}{2}t(1 - t) \bigl[x^{\mathsf{T}} Q x - 2x^{\mathsf{T}} Q y + y^{\mathsf{T}} Q y\bigr] \\ &= -\frac{1}{2}t(1 - t) (x - y)^{\mathsf{T}} Q (x - y)\\ &\overset{(a)}{\le} 0, \end{aligned} \notag \end{equation}

where if the final inequality \((a)\) holds over all \(x, y, t\), then we will have verified the definition of convexity for the function \(f\). Since \(t \in (0, 1)\) it must be that \(t(1 - t) \in (0, 1)\) as well, and because \(x, y\) appear only as the difference \((x - y)\), we can consider simply the condition where \(\forall z:\ z^{\mathsf{T}} Q z \ge 0\). This is exactly what it means for the matrix \(Q\) to be positive semi-definite, denoted \(Q \succeq 0\).

Thus (also using the fact that a convex function plus the linear function \(x \mapsto q^{\mathsf{T}} x\) is still convex), a quadratic form \(f\) is convex if and only if \(Q \succeq 0\). If the matrix \(Q\) is both non-singular and positive semi-definite, then it is positive-definite written as \(Q \succ 0\), which also means that the inequality in the definition of convexity holds strictly, and we call the function strictly convex.

Remark: It is worth noting here that every quadratic form associated with a Gaussian random variable is a convex quadratic form (since variance matrices are positive semidefinite). Moreover, for smooth functions, they are locally convex in the region of minimizers, and the local quadratic approximations of convex functions are themselves always convex. As well, the quadratic form $\frac{1}{2}||Ax||_2^2 = x^{\mathsf{T}} A^{\mathsf{T}} A x$ will always be convex as well since $A^{\mathsf{T}} A \succeq 0$. Convex quadratic forms arise almost as often as quadratic forms themselves!

Minimizing Quadratic Forms

From the representation of the quadratic form \(f\) after completing the square \[f(x) = \frac{1}{2}(x + Q^{-1} q)^{\mathsf{T}} Q (x + Q^{-1} q) + \kappa - q^{\mathsf{T}} Q^{-1} q,\] it appears rather immediate to identify a minimizer: \[x^\star = - Q^{-1} q.\] However, we must be careful – in making this claim we have already assumed that a minimizer exists, and the existence of minimizers should not be taken lightly. If the matrix \(Q\) is indefinite (i.e., \(Q\) nor \(-Q\) are positive semi-definite), then the point \(-Q^{-1} q\) may be only a saddle point and in fact no minimizer exists: \(\underset{x \in \mathbb{R}^n}{\text{inf}}\ f(x) = -\infty\). The question of existence in infinite dimensions is incredibly interesting , but it is straightforward in the present finite-dimensional case: we need only eliminate the possibility that the infimum is \(-\infty\) by the strict convexity assumption \(Q \succ 0\) (more generally, we can have \(Q \succeq 0\) and \(q \in \mathcal{R}(Q)\)). If \(Q \prec 0\), then \(x^\star\) will be a maximizer.

The actual value of the function \(f\) at the minimizer can also be easily identified by inspection after completing the square:

\[f^\star := f(x^\star) = -q^{\mathsf{T}} Q^{-1} q + \kappa.\]

It is an interesting observation that this is a concave quadratic form in \(q\).

The Shape of a Quadratic Form

We can get some intuition about convex and non-convex quadratic forms from the contour plots below. In the convex case, the function is a prototypical “bowl-shaped” function with a clearly defined minimizer, and the contour lines are concentric ellipses. In the non-convex case, the function is hyperbolic and the only minimizing sequences diverge. In this hyperbolic case, the point \(-Q^{-1} q\) is a saddle point.

Figure 1: Comparing Convex and Non-Convex Quadratic Forms

Figure 1: Comparing Convex and Non-Convex Quadratic Forms

The prototypical hyperbolic function is \(f(x, y) = xy\). However, it is clearly possible for functions containing \(xy\) terms to still be convex, e.g., \(f(x, y) = \frac{1}{2}(x^2 - xy + y^2)\) (the function on the left in the figure), but not \(f(x, y) = \frac{1}{2}(x^2 - 2xy + y^2)\) (the function on the right). Whether or not the function contains the pernicious hyperbolic behaviour comes down to the question of whether or not \(Q \succeq 0\).

Subspace Constraints

There are many important cases, particularly in optimization, where we want to restrict the domain of the function \(f\) to some subspace (actually, an affine space, which may include a constant offset from \(0\)) \(\mathbb{V} \subseteq \R^n\), of some dimension strictly less than \(n\), i.e., \(f: \mathbb{V} \rightarrow \R\). In practice, this typically arises through linear constraints \(Ax = b\) on the variable \(x\), where \(A \in \R^{n \times m}\). The representation \(\mathbb{V} = \{x\ |\ Ax = b\}\) however is not analytically insightful.

To obtain a more friendly representation of the space \(\mathbb{V}\), take a (compact) Singular Value Decomposition of the matrix: \(A = U\Sigma V^{\mathsf{T}}\) wherein \(U \in \R^{n \times r}\), \(\Sigma \in \R^{r \times r}\) and \(V \in \R^{m \times r}\); I use \(r \le \text{min}(m, n)\) to indicate the rank of the matrix \(A\). It is essential to assume that \(b \in \mathcal{R}(A)\), i.e., that \(b\) is in the range of \(A\) – if this is not the case then the subspace \(\mathbb{V}\) is ill-defined. Now, recall that the columns of \(U\) constitute an orthonormal basis of \(\mathcal{R}(A)\) of dimension \(r\) and that the columns of \(V\) constitute an orthonormal basis for \(\mathcal{N}(A)^\perp\), the orthogonal complement of the nullspace of \(A\), which is of dimension \(n - r\). Operationally, this means that \(U^{\mathsf{T}} U = I_r\) and \(V^{\mathsf{T}} V = I_r\) (but definitely not, at least in general, that \(UU^{\mathsf{T}} = I_n\) or \(VV^{\mathsf{T}} = I_m\)). Thus, we can simplify the representation of the subspace \(\mathbb{V}\) as in:

\begin{equation} \begin{aligned} \mathbb{V} &= \{x\ |\ Ax = b\}\\ &= \{x\ |\ U\Sigma V^{\mathsf{T}} x = b\}\\ &\overset{(a)}{=} \{x\ |\ V^{\mathsf{T}} x = \Sigma^{-1} U^{\mathsf{T}} b\}\\ &\overset{(b)}{=} \{V \Sigma^{-1} U^{\mathsf{T}} b + \overline{V}u\ |\ u \in \R^{n - r}\}\\ &\overset{( c)}{=} \{A^{\dagger} b + \overline{V}u\ |\ u \in \R^{n - r}\}, \end{aligned}\notag \end{equation}

where in \((a)\) it is always the case that \(\Sigma\) is invertible (and also diagonal) for a compact SVD and in \((b)\) the matrix \(\overline{V} \in \R^{n \times (n - r)}\) is a matrix whose columns form a basis for the nullspace of \(A\) (which can be obtained, for example, by running a full SVD on \(A\)), and in \(( c)\) we use the definition of the Moore-Penrose inverse of \(A\), denoted by \(A^{\dagger} = V\Sigma^{-1} U^{\mathsf{T}}\). The vector \(\bar{x} := V \Sigma^{-1} U^{\mathsf{T}} b\) is usually taken to be the nominal solution of \(Ax = b\) (it is easy to verify that it is a solution) and the remainder of the degrees of freedom are provided by the span of the columns of \(\overline{V}\). This makes more clear that \(\mathbb{V}\) is an affine space: the subspace \(\{\bar{V}u\ |\ u \in \R^{n - r}\}\) is offset from \(0\) by \(A^\dagger b\).

From here, a quadratic form \(f\) subject to the constraint \(x \in \mathbb{V}\) can be represented simply by modifying the parameters \(Q, q, \kappa\) and using the variable \(u\) from which the original \(x\) variable can be recovered by \(x(u) = A^{\dagger} b + \overline{V}u\):

\begin{equation} \begin{aligned} f(x) &= \frac{1}{2}x^{\mathsf{T}} Q x + q^{\mathsf{T}} x + \frac{1}{2}\kappa; \quad Ax = b\\ \rightarrow f(x(u)) &= \frac{1}{2}(A^{\dagger} b + \overline{V}u)^{\mathsf{T}} Q (A^{\dagger} b + \overline{V}u) + q^{\mathsf{T}} (A^{\dagger} b + \overline{V}u) + \frac{1}{2}\kappa\\ \rightarrow \tilde{f}(u) &= \frac{1}{2}u^{\mathsf{T}} \overline{V}^{\mathsf{T}} Q \overline{V} u + \bigl((\overline{V}^{\mathsf{T}} q + \overline{V}^{\mathsf{T}} QA^\dagger b \bigr)^{\mathsf{T}} u + q^{\mathsf{T}} A^\dagger b + \frac{1}{2}\kappa + \frac{1}{2} (A^\dagger b)^{\mathsf{T}} Q (A^{\dagger} b)\\ &= \frac{1}{2}u^{\mathsf{T}} \tilde{Q} u + \tilde{q}^{\mathsf{T}} u + \tilde{\kappa}\\ \end{aligned}\notag \end{equation}

where \(\tilde{Q} = \overline{V}^{\mathsf{T}} Q \overline{V}\), \(\tilde{q} = \overline{V}^{\mathsf{T}} (q + QA^\dagger b)\) and \(\tilde{\kappa} = \kappa + (A^\dagger b)^{\mathsf{T}} Q (A^{\dagger} b)\). Thus, the quadratic form restricted to a subspace \(\mathbb{V}\) is yet again just another quadratic form. Moreover, there are computationally tractable means of explicitly representing the subspace \(\mathbb{V}\) when it is described through the linear system \(Ax = b\).

Convex and Concave Subspaces

To get some further geometric sense of the function \(f\), one will usually appeal to the eigendecomposition of the matrix \(Q\). Since it is symmetric, it is guaranteed to be orthogonally diagonalizable, and admits of real eigenvalues. This is The Spectral Theorem . Specifically, we have \[Q = V\Lambda V^{\mathsf{T}} = \sum_{i = 1}^N \lambda_i v_i v_i^{\mathsf{T}} \] where \(V \in \mathbb{R}^{n \times n}\) is an orthogonal matrix of eigenvectors, and \(\Lambda = \mathsf{Dg}(\lambda_1, \ldots, \lambda_n)\) is a diagonal matrix containing the associated eigenvalues. The space can be split into three subspaces, call them

\begin{equation} \begin{aligned} S_+ &= \mathsf{span}\bigl\{v_i\ |\ \lambda_i > 0\}\\ S_- &= \mathsf{span}\bigl\{v_i\ |\ \lambda_i < 0\}\\ S_0 &= \mathsf{span}\bigl\{v_i\ |\ \lambda_i = 0\}. \end{aligned} \end{equation}

The subspace \(S_+\) might be called the “convex subspace” since the function \(x \mapsto \frac{1}{2} x^{\mathsf{T}} Q x\) is convex when restricted to \(x \in S_+\), and \(S_-\) might be called the “concave subspace” since \(x \mapsto \frac{1}{2} x^{\mathsf{T}} Q x\) is concave when restricted to \(x \in S_-\). The subspace \(S_0\) is the linear subspace since \(\forall x \in S_0:\ \frac{1}{2} x^{\mathsf{T}} Q x = 0\) – when a quadratic form is restricted to \(S_0\) it degenerates into a linear function. The subspace \(S_0 = \{0\}\) if \(Q\) is non-singular, and \(S_- = \{0\}\) if \(Q \succeq 0\).

Remark: There are practical reasons why these subspace distinctions are interesting. In some natural problems, we may encounter a matrix $Q$ which is indefinite, but where certain problem constraints guarantee that we are fully confined to the convex subspace $S_+$. Therefore, given the constraints of the problem, the quadratic form associated with $Q$ is “effectively” convex.

Using ideas from the previous section, we can animate 2-d slices of a quadratic form in \(\R^4\) where the 2-d slices are constructed to smoothly vary across convex and concave subspaces. The transition between the two, where the subspace passes through \(S_+\) and \(S_-\) could be termed a hyperbolic subspace of the quadratic form.

Figure 2: Slices of a Quadratic Form Across Convex, Concave, and Hyperbolic Subspaces

Figure 2: Slices of a Quadratic Form Across Convex, Concave, and Hyperbolic Subspaces

I created this animation by constructing the matrix \(Q = 2q_1 q_1^{\mathsf{T}} + q_1 q_1^{\mathsf{T}} - p_1 p_1^{\mathsf{T}} - 2 p_2 p_2^{\mathsf{T}}\) where \(q_1, p_1, q_2, p_2\) form an orthogonal basis of \(\mathbb{R}^4\), and then using a matrix \[H(t) = [\frac{3}{2}(\frac{2}{3} - t)_+ q_1 + 3(t - \frac{2}{3})_+ p_1, \frac{3}{2}(t - \frac{1}{3})_+ p_2 + 3(\frac{1}{3} - t)_+ q_2]\] to project \(x = H(t) u\) onto convex, hyperbolic, and concave subspaces depending on the value of \(t\). The plot is over \(u \in \mathbb{R}^2\) and animated over \(t \in [0, 1]\).

Linear Algebra

The linear algebra related to quadratic forms largely comes down to the linear algebra associated with the symmetric matrix \(Q\). The main aspects are the Cholesky factorization and Schur complements. The Cholesky decomposition is one of the most important (and fast!) matrix factorizations available, and the Schur complement arises in many different ways. Moreover, these two aspects are intimately linked.

Schur Complements

Quadratic forms often arise as bivariate quadratic forms (or more generally, multivariate quadratic forms with natural distinctions between different blocks of variables). For now, consider the function of \(x \in \mathbb{R}^{n_x}\) and \(y \in \mathbb{R}^{n_y}\) given by

\begin{equation} \begin{aligned} f(x, y) &= \frac{1}{2}\begin{bmatrix}x\\ y\\ 1\end{bmatrix}^{\mathsf{T}}\begin{bmatrix}P & C & p\\ C^{\mathsf{T}} & Q & q\\ p^{\mathsf{T}} & q^{\mathsf{T}} & \kappa \end{bmatrix}\begin{bmatrix}x\\ y\\ 1\end{bmatrix}\\ &= \frac{1}{2}x^{\mathsf{T}}Px + \frac{1}{2}y^{\mathsf{T}}Qy + x^{\mathsf{T}} C y + p^{\mathsf{T}} x + q^{\mathsf{T}} y + \kappa \end{aligned} \notag \end{equation}

where I’ll assume \(P, Q\) are both symmetric, that \(P\) is non-singular, and that \(C\) is such that the overall block matrix is symmetric positive definite. These assumptions mean that the function \(f(x, y)\) is jointly convex in the variables \((x, y)\).

There are two other interesting functions associated with \(f\), namely, \[y \mapsto \underset{x}{\text{inf}}\ f(x, y),\] and \[x \mapsto \underset{y}{\text{inf}}\ f(x, y).\]

The computation of these functions can be done through a mechanical procedure called Schur-complementation. The idea of the Schur complement can serve as a natural “chunk” in one’s cognition and enables easy manipulation of complicated expressions.

Partial Minimization

The above functions are those formed from partial minimization of the function \(f\), and for concreteness I’ll focus on the function $g(y) := \underset{x}{\text{inf}}\ f(x, y).$ We know from the previous section that we can guarantee that a minimizer exists if \(P \succ 0\), so we will also make this assumption here to avoid degeneracy.

To identify the minimizer of \(f\) with respect to \(x\), we need only make the identifications \(q \leftarrow Cy + p\) and \(\kappa \leftarrow y^{\mathsf{T}} Q y + q^{\mathsf{T}} y + \kappa\) in the original general definition of a quadratic form to recognize that \(x^\star(y) = -P^{-1} (Cy + p)\). An explicit formula for the function \(g\) is then obtained through the calculation

\begin{equation} \begin{aligned} g(y) &= \frac{1}{2}y^{\mathsf{T}} Q y - y^{\mathsf{T}} C^{\mathsf{T}} P^{-1} (C y + p) + q^{\mathsf{T}} y + \kappa \\ &= \frac{1}{2}y^{\mathsf{T}} (Q - C^{\mathsf{T}} P^{-1} C) y + (q - C^{\mathsf{T}}P^{-1} p)^{\mathsf{T}} y + \kappa\\ &= \frac{1}{2}\begin{bmatrix}y\\ 1 \end{bmatrix}^{\mathsf{T}} \begin{bmatrix}Q - C^{\mathsf{T}} P^{-1} C & (q - C^{\mathsf{T}}P^{-1} p)\\ (q - C^{\mathsf{T}}P^{-1} p)^{\mathsf{T}} & \kappa \end{bmatrix} \begin{bmatrix}y\\ 1\end{bmatrix}. \end{aligned} \notag \end{equation}

The beauty of this partial minimization is that \(g\) is still a quadratic form. If it is still convex (we’ll check this next!) then we could say that the set of positive definite quadratic forms is closed under the operation of partial minimization.

The formulas above seem rather complicated at first. However, they follow a completely mechanical pattern and you can write down analytic expressions for cartoonishly complicated block quadratic forms with simple notation, and understand at a glance exactly what is going on (even if you’re not reading every single equation in detail).

Definiteness Conditions

What we have discovered through exploring this is called a Schur Complement . Suppose for simplicity that \(p = q = 0\) and \(\kappa = 0\) and for sake of notation that \[\mathcal{Q} := \begin{bmatrix}P & C\\ C^{\mathsf{T}} & Q\end{bmatrix}.\] Then, \(g(y) = y^{\mathsf{T}}(Q - C^{\mathsf{T}} P^{-1} C)y\) and the matrix \(Q - C^{\mathsf{T}} P^{-1} C\) is called The Schur Complement of \(P\) in \(\mathcal{Q}\) and it is usually denoted \(\mathcal{Q} / P\). There is also a Schur complement of \(Q\) in \(\mathcal{Q}\) given by \(\mathcal{Q} / Q := P - CQ^{-1}C^{\mathsf{T}}\). The way to remember which is which is to think of the “\(/P\)” as a form of division, analogous to \(P^{-1}\) (now you remember which matrix is being inverted) and then to just match up the shapes of \(C\) and \(C^{\mathsf{T}}\) with \(P\) or \(Q\) to figure out if you should write \(C^{\mathsf{T}} P^{-1} C\) or \(CP^{-1} C^{\mathsf{T}}\).

Forming the matrix \(\mathcal{Q} / P\) requires at least that \(P\) be non-singular. If, in addition, it is positive definite \(P \succ 0\), then \(g(y) = y^{\mathsf{T}} (\mathcal{Q} / P) y\) is the partial minimizer of the whole quadratic form with respect to \(x\). In order for \(g(y)\) to have a minimizer (i.e., the infimum is not \(-\infty\)) then the Schur complement must be positive-definite as well: \(\mathcal{Q} / P \succ 0\). This, combined with the the fact that \[\underset{x}{\text{inf}}\ \underset{y}{\text{inf}}\ f(x, y) = \underset{y}{\text{inf}}\ \underset{x}{\text{inf}}\ f(x, y) = \underset{x, y}{\text{inf}}\ f(x, y),\] implies the following methods of checking definiteness:

  • If \(P \succ 0\) and \(\mathcal{Q} / P \succ 0\), then \(\mathcal{Q} \succ 0\).
  • If \(\mathcal{Q} \succ 0\) and \(P\) is non-singular, then \(P \succ 0\) and \(\mathcal{Q} / P \succ 0\).

These relations have important theoretical use.

Square Roots and Cholesky Factorization

There is a natural way of discovering the Cholesky Decomposition by continuing the minimization of the block quadratic form \(f(x, y)\). Indeed, computing the minimizer \((x, y)\) of this function involves solving the linear system

\begin{equation} \begin{aligned} \begin{bmatrix}P & C\\ C^{\mathsf{T}} & Q\end{bmatrix} \begin{bmatrix}x\\ y \end{bmatrix} = \begin{bmatrix}p\\ q \end{bmatrix}, \end{aligned} \notag \end{equation}

involving the matrix \(\mathcal{Q}\) discussed above. Following the idea of (block) Gauss-Jordan elimination on this matrix, the Schur complement makes another appearance:

\begin{equation} \begin{aligned} \begin{bmatrix}P & C\\ C^{\mathsf{T}} & Q\end{bmatrix} &= \begin{bmatrix}I & 0\\ C^{\mathsf{T}}P^{-1} & I\end{bmatrix}\begin{bmatrix}P & C\\ 0 & Q - C^{\mathsf{T}} P^{-1} C\end{bmatrix}, \end{aligned} \notag \end{equation}

which is a block LU decomposition of \(\mathcal{Q}\), having a (dense) Schur-complement in the lower-right entry of the block upper triangular matrix. Now, rather than a block triangular decomposition, what we would really like to have is a bona-fide triangular decomposition \(\mathcal{Q} = LU\) where \(L\) is lower-triangular and \(U\) is upper triangular. Moreover, in analogy with non-negative numbers, which have square-roots, we would like to ask for even more of a positive definite matrix \(\mathcal{Q} \succ 0\) (the positive semi-definite case is possible, but more involved) – we would like an LU-decomposition \(\mathcal{Q} = LL^{\mathsf{T}}\), i.e., where \(U = L^{\mathsf{T}}\), which is a matrix square root.

Operating under the standing assumption that \(\mathcal{Q} \succ 0\), and that \(P\) is non-singular (hence \(P \succ 0\) by the above), we will obtain this factorization by induction. Make the induction hypothesis that any positive definite matrix admits of a lower-triangular square root and let \(P = L_{11}L_{11}^{\mathsf{T}}\). As well, using the induction hypothesis again, along with the definiteness theorem described in the last section, we know that the Schur complement \(\mathcal{Q} / P = Q - C^{\mathsf{T}} P^{-1} C \succ 0\) is also positive definite and thus admits another decomposition \(\mathcal{Q} / P = L_{22}L_{22}^{\mathsf{T}}\). We can now write

\begin{equation} \begin{aligned} \mathcal{Q} &= \begin{bmatrix}I & 0\\ C^{\mathsf{T}}P^{-1} & I\end{bmatrix}\begin{bmatrix}P & C\\ 0 & Q - C^{\mathsf{T}} P^{-1} C\end{bmatrix}\\ &= \begin{bmatrix}I & 0\\ C^{\mathsf{T}} (L_{11} L_{11}^{\mathsf{T}})^{-1} & I\end{bmatrix}\begin{bmatrix}L_{11} L_{11} & C\\ 0 & L_{22}L_{22}^{\mathsf{T}} \end{bmatrix}\\ &= \begin{bmatrix}L_{11} & 0\\ C^{\mathsf{T}} L_{11}^{-\mathsf{T}} & L_{22} \end{bmatrix}\begin{bmatrix}L_{11} & 0\\ C^{\mathsf{T}} L_{11}^{-\mathsf{T}} & L_{22} \end{bmatrix}^{\mathsf{T}}, \end{aligned} \notag \end{equation}

which is a Cholesky decomposition obtained recursively from the decomposition of \(P\) and the Schur complement \(\mathcal{Q} / P\). Beginning the recursion with \(P = \mathcal{Q}_{11}\) (the upper left entry) and \(L_{11} = \sqrt{P}\) leads to a practical algorithm for computing the decomposition. The definiteness condition of \(\mathcal{Q} \succ 0\) is essential to continuing this process since otherwise we would not be able to recurse on the Schur complement \(\mathcal{Q} / P\). What would happen in practice is that we would try to divide by zero or to take the square root of a negative number.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
from math import sqrt

FLOAT = np.float64

class LinAlgError:
    pass

def cholesky(A):
    """An extremely terrible Cholesky implementation."""
    A = np.array(A, dtype=FLOAT)
    n = A.shape[0]  # A = [[P; C]; [C'; Q]]
    if len(A.shape) != 2 or A.shape[1] != n:
        raise ValueError(f"A must be a square matrix but got {A.shape=}")
    if not np.all(A == A.T):
        raise ValueError("A must be a symmetric matrix!")

    L = np.zeros_like(A)
    Q = np.copy(A)
    for i in range(n):
        P = float(Q[0, 0])
        C = Q[0, 1:][None, :]  # a row vector
        Q = Q[1:, 1:]

        try:
            L[i, i] = sqrt(P)  # This will detect negative eigenvalues
        except ValueError:
            raise LinAlgError("Matrix A is not positive definite")

        L[i + 1 :, i] = (1 / L[i, i]) * C
        Q = Q - C.T @ (C / P)  # Division by P will fail on a *semi*-definite matrix
    return L

for n in np.random.randint(3, 30, size=1000):
    A = np.random.normal(size=(n, n))
    A = A @ A.T
    L = cholesky(A)
    assert np.allclose(L @ L.T, A)

return "Passed :)"  # org-mode recognizes this odd 'return' as the output
1
Passed :)

Remark (Testing Definiteness): The Cholesky decomposition is not only guaranteed to succeed if $\mathcal{Q} \succ 0$, it is also guaranteed to fail if $\mathcal{Q} \succ 0$ does not hold. Thus, computationally, the Cholesky decomposition serves as a reliable means of testing the definiteness of a matrix: a matrix $A$ is positive definite if and only if a Cholesky decomposition algorithm successfully terminates in a factorization $A = LL^{\mathsf{T}}$.

Remark (Computing Schur Complements): Even though we’ve derived the Cholesky decomposition from the Schur complement, we can also go the other way. That is, the Cholesky decomposition algorithm can be used to compute Schur complements. Moreover, this results directly in the Cholesky decomposition of the Schur complement itself, and therefore in a representation of $\mathcal{Q} / P$ which provides a numerical guarantee on the positive definiteness of the Schur complement. To see this, consider the calculations below:

\begin{equation} \begin{aligned} \mathcal{Q} &= \begin{bmatrix}P & C\\ C^{\mathsf{T}} & Q\end{bmatrix}\\ &= \begin{bmatrix}L_{11} & 0\\ L_{12} & L_{22} \end{bmatrix}\begin{bmatrix}L_{11}^{\mathsf{T}} & L_{12}^{\mathsf{T}} \\ 0 & L_{22}^{\mathsf{T}} \end{bmatrix}\\ &= \begin{bmatrix}L_{11}L_{11}^{\mathsf{T}} & L_{11}L_{12}^{\mathsf{T}} \\ L_{12}L_{11}^{\mathsf{T}} & L_{12}L_{12}^{\mathsf{T}} + L_{22}L_{22}^{\mathsf{T}} \end{bmatrix} \end{aligned} \notag \end{equation}

whereby we can calculate the Schur complement $$\mathcal{Q} / P = L_{12}L_{12}^{\mathsf{T}} + L_{22}L_{22}^{\mathsf{T}} - L_{12}L_{11}^{\mathsf{T}} (L_{11}L_{11}^{\mathsf{T}})^{-1} L_{11}L_{12} = L_{22}L_{22}^{\mathsf{T}}.$$ Thus, the Cholesky decomposition of $\mathcal{Q} / P$ is nothing but the bottom-right block of the Cholesky decomposition of the block matrix $\mathcal{Q}$. The advantage of this method is in numerical stability: the subtraction $Q - C^{\mathsf{T}} P^{-1} C$ not only involves a pesky inversion of the matrix $P$, but it also involves the difference of two positive definite matrices, which, due to numerical roundoff, can result in a matrix which is in actuality indefinite. Using this method, if the Cholesky algorithm succeeds, then we are in the clear.

Probability and Statistics

Quadratic forms appear in many places in statistic, largely as a result of computing second order statistics (the mean and the variance). Remarkably, though I derived them from examples in optimization, both the Schur complement and the Cholesky factorization play important roles in this theory as well.

Computations with the Mean and the Variance

Two of the most important statistics of an $\mathbb{R}$-valued random variable, call it \(X\), are the mean and the variance \(\mu = \mathbb{E}X\) and \(\sigma^2 = \mathbb{E}(X - \mu)^2\). Taken together, \((\mu, \sigma^2)\) are the second-order statistics of \(X\). The mean value measures the central tendency of \(X\), serving as one estimate of its “typical” value, and the variance \(\sigma^2\) is the average squared distance of \(X\) from its mean value. When \(\sigma^2\) is small, then the probability that \(X\) is close to \(\mu\) is high, and when \(\sigma^2\) is large, that probability is small.

For vectors of random variables \(\mathbf{X} = (X_1, \ldots, X_n) \in \mathbb{R}^n\) the mean is simply defined element-wise \[\mathbb{E}\mathbf{X} = \begin{bmatrix} \mathbb{E}X_1\\ \vdots\\ \mathbb{E}X_n\\ \end{bmatrix} = \begin{bmatrix}\mu_1\\ \vdots \\ \mu_n \end{bmatrix} = \bm{\mu}.\] The variance of the vector random variable becomes somewhat more complicated, as the best definition is now a matrix. This variance matrix, usually denoted \(\Sigma \in \mathbb{R}^{n \times n}\), is defined through an outer-product of the centered random variables \[\Sigma = \mathbb{E}(\mathbf{X} - \mathbf{\mu})(\mathbf{X} - \mathbf{\mu})^{\mathsf{T}},\] and includes all of the individual (i.e., marginal) variance \(\Sigma_{ii} = \mathbb{E}(X_i - \mu_i)^2\), but also all of the co-variances \(\Sigma_{ij} = \mathbb{E}(X_i - \mu_i)(X_j - \mu_j)\). These covariance terms provide a measure of how the random variables in the vector tend to move together – i.e., they provide a quantitative measure of dependence. The reason for this interpretation may become clear when we look at estimating one random variable given another.

Sampling

It is easy to see that the covariance matrix is symmetric – indeed, \(\Sigma_{ij} = \mathbb{E}(X_i - \mu_i)(X_j - \mu_j) = \mathbb{E}(X_j - \mu_j)(X_i - \mu_i) = \Sigma_{ji}\). What is somewhat less obvious is that it is also positive semi-definite \(\Sigma \succeq 0\). To see this:

\begin{equation} \begin{aligned} a^{\mathsf{T}} \Sigma a^{\mathsf{T}} &= a^{\mathsf{T} }\bigl[\mathbb{E}(\mathbf{X} - \bm{\mu})(\mathbf{X} - \bm{\mu})^{\mathsf{T}} \bigr] a\\ &\overset{(a)}{=} \mathbb{E} a^{\mathsf{T}} (\mathbf{X} - \bm{\mu})(\mathbf{X} - \bm{\mu})^{\mathsf{T}} a\\ &\overset{(b)}{=} \mathbb{E} ||a^{\mathsf{T}} (\mathbf{X} - \bm{\mu})||_2^2 \ge 0, \end{aligned}\notag \end{equation}

where \((a)\) uses the linearity of expectation \(a\mathbb{E}X = \mathbb{E}[aX]\) and \((b)\) uses \(a^{\mathsf{T}} a = ||a||_2^2 := \sum_{i = 1}^n a_i^2\). For simplicity, I will assume that \(\Sigma\) is positive definite \(\Sigma \succ 0\), which is generally the case – if \(\Sigma\) is not positive definite, then it can be shown that (at least) two components of the vector \(\mathbf{X}\) must be perfectly linearly related, e.g., \(X_1 = X_2 - X_3\), a case we simply exclude.

Since \(\Sigma \succ 0\), it must admit a Cholesky factorization \(\Sigma = LL^{\mathsf{T}}\). This factorization can be used computationally to sample random variables with a prescribed variance matrix. As long as it is possible to sample independent random variables with variance \(1\), then we can construct a random vector with variance equal to the \(n \times n\) identity matrix \(I_n\) (how?) – call this vector \(\mathbf{Z}\). Supposing that \(\mathbf{Z}\) has zero mean \(\mathbb{E}\mathbf{Z} = 0\) (adding back a constant mean vector \(\bm{\mu}\) afterwards is easy) we can now transform \(\mathbf{Z}\) so that it has variance \(\Sigma\): let \(\mathbf{X} = L\mathbf{Z}\) and calculate:

\begin{equation} \begin{aligned} \mathbb{E}\mathbf{XX}^{\mathsf{T}} &= \mathbb{E}(L\mathbf{Z})(L\mathbf{Z})^{\mathsf{T}}\\ &= \mathbb{E}L\mathbf{Z}\mathbf{Z}^{\mathsf{T}} L^{\mathsf{T}}\\ &= L\bigl[\mathbb{E}\mathbf{Z}\mathbf{Z}^{\mathsf{T}} \bigr]L^{\mathsf{T}}\\ &= LI_n L^{\mathsf{T}}\\ &= \Sigma. \end{aligned}\notag \end{equation}

Incidentally, a slight modification to the calculation above can show us that if \(\mathbf{X}\) has variance matrix \(\Sigma\), then for any matrix \(A\), the variance of \(A\mathbf{X}\) is given by \(A\Sigma A^{\mathsf{T}}\).

Linear Minimum Mean Squared Error Estimation

To understand more about the meaning of the variance matrix, let’s see how it appears in a particularly important quadratic optimization problem: the linear minimum mean-squared error estimator (LMMSE estimator). The setup for this problem is that we have a joint random variable \((X, Y)\) where \(X\) has dimension \(m\) and \(Y\) has dimension \(n\), and where \(X, Y\) have a known joint variance matrix:

\begin{equation} \text{Var}\begin{bmatrix}X\\ Y\end{bmatrix} = \begin{bmatrix}\Sigma_X & R\\ R^{\mathsf{T}} & \Sigma_Y \end{bmatrix} := \Sigma \notag \end{equation}

\(\Sigma \in \mathbb{S}_{++}^{n + m}\), and mean values \(\mu_X \in \mathbb{R}^m \mu_Y \in \mathbb{R}^n\). We want to find a “gain matrix” \(K \in \mathbb{R}^{m \times n}\) and a “bias vector” \(b \in \mathbb{R}^{n}\) to solve the following problem:

\begin{equation} \begin{aligned} \underset{K, b}{\text{minimize}}\ \frac{1}{2}\mathbb{E}||X - KY - b||_2^2. \end{aligned}\notag \end{equation}

The interpretation of this problem is that we are making an actual observation of \(Y\), and want to use that to make some inference about what the value of \(X\) is, given the joint second order statistics (mean and variance) of \((X, Y)\). An example application is that \(X\) is some “state” we are interested in (e.g., temperature, position, velocity, etc.) and \(Y\) is the measurement of some sensor system.

To solve this problem, first start by minimizing over \(b\). If we expand the objective we have

\begin{equation} \begin{aligned} \frac{1}{2}\mathbb{E} ||X - KY - b||_2^2 &= \frac{1}{2}\mathbb{E}||X - KY||_2^2 + \frac{1}{2}||b||_2^2 - \mathbb{E}(X - KY)^{\mathsf{T}} b\\ &= \frac{1}{2}\bigl((\mu_X - K\mu_Y) - b\bigr)^{\mathsf{T}} \bigl((\mu_X - K\mu_Y) - b\bigr) + \frac{1}{2}\mathbb{E}||X - KY||_2^2 - (\mu_X - K\mu_Y)^{\mathsf{T}}(\mu_X - K\mu_Y), \end{aligned}\notag \end{equation}

from which it is clear the optimal value is \(b = \mu_X - K\mu_Y\). It is quite convenient that the optimal value of \(b\) makes \(X - KY - b\) have zero mean. Because of this we can simply assume, without loss of generality, that \(\mathbb{E}X = 0\) and \(\mathbb{E}Y = 0\); indeed, we could make the substitutions \(X_0 = X - \mu_X\) etc., but this is unnecessarily cumbersome. So, making this zero mean assumption, we again expand out the objective function

\begin{equation} \begin{aligned} \mathbb{E}||X - KY||_2^2 &= \mathbb{E}(X - KY)^{\mathsf{T}}(X - KY)\\ &= \mathbb{E}\bigl( X^{\mathsf{T}} X - X^{\mathsf{T}}KY - Y^{\mathsf{T}}K^{\mathsf{T}}X - Y^{\mathsf{T}}K^{\mathsf{T}}KY \bigr)\\ &= \mathsf{tr}\bigl(\Sigma_X - KR^{\mathsf{T}} - R^{}K^{\mathsf{T}} + K\Sigma_Y K^{\mathsf{T}} \bigr)\\ &= \mathsf{tr}(K - R\Sigma_Y^{-1})\Sigma_Y(K - R\Sigma_Y^{-1})^{\mathsf{T}} + \mathsf{tr}(\Sigma_X - R \Sigma_Y^{-1} R^{\mathsf{T}}), \end{aligned}\notag \end{equation}

where in the last step we are again completing the square. The second trace term does not depend upon \(K\), and the first term is minimized by the choice \(K = R\Sigma_Y^{-1}\) (why? Obtain a generalization for quadratic forms and use \(\Sigma_Y \succ 0\)). Remarkably, the value of the objective at the optimum is given exactly by

\begin{equation} \underset{K}{\text{min}}\ \mathbb{E}||X - KY||_2^2 = \mathsf{tr}(\Sigma_X - R \Sigma_Y^{-1} R^{\mathsf{T}}) = \mathsf{tr}(\Sigma / \Sigma_Y), \notag \end{equation}

which is another Schur complement! More generally, at \(K^\star = R\Sigma_Y^{-1}\) we have that \(\text{Var}(X - K^\star Y) = \Sigma / \Sigma_Y\) so that the variance matrix of the error vector \(E = X - KY\) is given exactly by a Schur complement in the original joint variance matrix. This Schur complement is measuring the amount of variance which is removed from \(\Sigma_X\) by conditioning on \(Y\). Another worthwhile observation is that \(\mathsf{tr}E^{\mathsf{T}} K^\star Y = \mathsf{tr}(X - R\Sigma_Y^{-1} Y)^{\mathsf{T}} R\Sigma_Y^{-1} Y = 0\), which is to say that the estimate is orthogonal to the error. This is not a coincidence and is a consequence of the Projection Theorem .

We can construct an illustrative example of these calculations, along with the sampling scheme of the previous section in two dimensions. For this illustration I’m using the variance matrix \(\Sigma = \begin{bmatrix}\sigma_X^2 & \rho\sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{bmatrix}\) and thus the optimal gain is \(K = \rho \sigma_X \sigma_Y \times \frac{1}{\sigma_y^2} = \rho \frac{\sigma_X}{\sigma_Y}\) which leads to \(\hat{X}|Y = \rho \frac{\sigma_X}{\sigma_Y} Y\), and \(\sigma^2_{X|Y} = (1 - \rho)\sigma_X^2\) which are all worthwhile formulas to commit to memory. The particular values used in the below example are \(\sigma_X^2 = 4, \sigma_Y^2 = 1, \rho = 0.8\). The way that the estimator changes in response to changes in parameters can be visualized by inspecting the equations provided earlier – in particular, the gain \(K\) and the conditional variance \(\sigma^2_{X|Y}\) both change linearly in response to changes in \(\rho\) or \(\sigma_X^2\).

Figure 3: Optimal Linear Estimators

Figure 3: Optimal Linear Estimators

It may at first look a bit odd that the regression line is not aligned with the principle axis of the contour plot. This is not a mistake. The regression line which is aligned with that “natural looking” direction is obtained from Principle Component Regression , and is sub-optimal in this setting where we know exact statistics. If the statistics \(\Sigma\) have to be estimated from available data, then it is possible that the regularization effect got from looking only at some top \(k\) principle components gives us overall lower squared error on a test set.

Remark (MMSE Estimator): As opposed to the LMMSE estimator, the MMSE estimator is the optimal estimator which is not constrained to be linear. The MMSE in general depends upon all of the higher order statistics, and cannot be computed without full knowledge of the joint distribution of $X, Y$.

Remark (The Gaussian Distribution): These formulas also arise in the conditioning formulas for the Gaussian distribution . This is not exactly a coincidence – Gaussian’s are fully characterized by their second order statistics and, as a family of distributions, are closed under linear combinations. In the Gaussian case, the MMSE coincides with the LMMSE.

Linear Regression

Let’s look at yet another example: linear regression. This is a problem which is closely related to Linear Minimum Mean Squared Error Estimation , but instead of using the observations \(Y\) of a sensor to estimate some state variable \(X\), we consider a finite dataset \(\mathcal{D}_N = \{(x_i, y_i)\}_{i = 1}^N\) of \(N\) examples. The goal of linear regression is to find some model function \(f\) such that the average squared error of the model \[y_i \approx f(x_i)\] is minimized. Formally, we would write

\begin{equation} \underset{f \in \mathcal{F}}{\text{minimize}}\ \frac{1}{2N}\sum_{i = 1}^N \big(y_i - f(x_i)\big)^2 \tag{$R_{\mathcal{F}}$} \end{equation}

where \(\mathcal{F}\) is some “admissible class” of functions.

The simplest useful class of functions \(\mathcal{F}\) are linear functions \(f(x) = x^{\mathsf{T}} \beta\), parameterized by some vector \(\beta\). Specializing the problem to this case is the problem of (ordinary) linear regression. By expanding the objective function we can easily pluck out the optimizer:

\begin{equation} \begin{aligned} \frac{1}{2N}\sum_{i = 1}^N \big(y_i - f(x_i)\big)^2 &= \frac{1}{2N}\sum_{i = 1}^N \big(y_i - x_i^{\mathsf{T}} \beta \big)^2\\ &= \frac{1}{2N}\sum_{i = 1}^N y_i^2 - \frac{1}{N}\sum_{i = 1}^N y_i (x_i^{\mathsf{T}} \beta) + \frac{1}{2N}\sum_{i = 1}^N (x_i^{\mathsf{T}} \beta)^2\\ &= \kappa - \big(\frac{1}{N}\sum_{i = 1}^N y_i x_i \big)^{\mathsf{T}} \beta + \frac{1}{2N}\sum_{i = 1}^N \mathsf{tr}\ (x_i^{\mathsf{T}} \beta)(x_i^{\mathsf{T}} \beta) \\ &= \kappa - \big(\frac{1}{N}\sum_{i = 1}^N y_i x_i \big)^{\mathsf{T}} \beta + \frac{1}{2} \beta^{\mathsf{T}} \big(\frac{1}{N}\sum_{i = 1}^N x_i x_i^{\mathsf{T}} \big) \beta, \end{aligned}\notag \end{equation}

and hence \[\beta^\star_{lr} = \big( \frac{1}{N}\sum_{i = 1}^N x_i x_i^{\mathsf{T}} \big)^{-1} \big( \frac{1}{N}\sum_{i = 1}^N y_i x_i \big).\]

The whole problem can be made considerably simpler by means of matrix notation

\begin{equation} \mathbf{y} = \begin{bmatrix}y_1\\ \vdots\\ y_N \end{bmatrix}, \mathbf{X} = \begin{bmatrix}x_1^{\mathsf{T}} \\ \vdots \\ x_N^{\mathsf{T}} \end{bmatrix},\notag \end{equation}

so that

\begin{equation} \begin{aligned} \frac{1}{2N} \sum_{i = 1}^N \bigl( y_i - x_i^{\mathsf{T}} \beta \bigr)^2 &= \frac{1}{2N}||\mathbf{y} - \mathbf{X}\beta||_2^2\\ \implies \beta^\star_{lr} = \bigl(\mathbf{X}^{\mathsf{T}} \mathbf{X}\bigr)^{-1} \mathbf{X}^{\mathsf{T}} \mathbf{y}, \end{aligned}\notag \end{equation}

which is a rather famous expression.

Remark (Admissible Functions): Contrary to initial impressions, regression problem $R_\mathcal{F}$ can actually be solved, in closed form, at least up to an $\epsilon$ of approximation error, for an extremely large class of functions $\mathcal{F}$. Indeed, for just one example when each $x$ is a single real number (to lighten notation) $x_i \in \R$ (as opposed to a vector) and $\mathcal{F}$ is the class of all bounded and continuous functions, then any $f \in \F$ can be (uniformly) approximated to arbitrary accuracy using a polynomial $f_a(x) \approx \sum_{k = 0}^K a_k x^k$ parameterized by the coefficients $a$ (this is the Stone-Weierstrass Theorem ). Each sample $x_i$ can then be transformed into a vector $(1, x_i, x_i^2, \ldots, x_i^K)$, and we can perform linear regression with the parameters $\beta = (1, a_1, \ldots, a_K)$.

Remark: The way I’ve written the optimal solution $\beta_{lr}^\star$ is naive. Computationally, the problem will usually be solved by means of a QR or Singular Value decomposition of the data matrix $X$ or by an iterative descent algorithm. Analytically, it is also possible to write the optimal solution using the pseudo-inverse as in $\beta_{lr}^\star = X^\dagger y$, which is much more notationally compact, and doesn’t assert the existence of an inverse matrix. Indeed, writing out \(\bigl(\mathbf{X}^{\mathsf{T}} \mathbf{X}\bigr)^{-1} \mathbf{X}^{\mathsf{T}} \mathbf{y}\) is a plain waste of ink.

Conclusion

This post is a bit of a whirlwind tour of fundamental facts about quadratic forms and positive definite matrices. All of these facts are truly elementary pieces of the computational linear algebra toolkit, and serve as building blocks for more sophisticated problems. Indeed, a substantial motivation for writing all of this down is that now I can fearlessly use all these facts in future posts without feeling like I’m skipping over important details!

I hope you, dear reader (I trust there is at least one 🙏), have enjoyed this tour and can now comfortably apply this knowledge to your own problems :)