Quant Out of Water
Lyapunov Stability, Linear Systems, and Semidefinite Programming

Lyapunov Stability, Linear Systems, and Semidefinite Programming

Dynamical systems are ubiquitous models occuring in science, engineering, and mathematics. Not only are they used to model real-world dynamic phenomena like the dynamics of chemical plants, population growth, and physical engineered systems, they can also be applied to model algorithms themselves. This post focuses on linear dynamical systems, their analysis by means of semidefinite programming, and connections with control theory through the computation of quadratic functionals of their paths.

Introduction

This post introduces the Lyapunov approach to stability and convergence analysis. While this approach and its connection to semidefinite programming may seem like “overkill” for linear systems (where you can just look at the eigenvalues of the system matrix), I believe it is an insightful starting point. Analysis based on Lyapunov functions generalizes naturally to both nonlinear and stochastic systems, and the connection with semidefinite programming can be exploited to derive an analysis which may be, in some sense, optimal.

References

Much of the material in this post is derived or inspired from: Boyd, Stephen, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear matrix inequalities in system and control theory. Society for industrial and applied mathematics, 1994. This is an easily accessible book on linear matrix inequalities. I have also drawn some ideas and inspiration from the Convex Optimization book (also freely available ) Boyd, Stephen, Stephen P. Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. and have made use of cvxpy for many some of the examples: Diamond, Steven, and Stephen Boyd. “CVXPY: A Python-embedded modeling language for convex optimization.” The Journal of Machine Learning Research 17, no. 1 (2016): 2909-2913.

Linear Dynamical Systems

Linear dynamical systems are generally characterized by a matrix ARn×nA \in \R^{n \times n} which describes how the state vector xRnx \in \R^n changes through time. For continuous time dynamical systems, we work with ordinary differential equations

ddtx(t)=Ax(t),\begin{equation} \frac{\mathsf{d}}{\mathsf{d} t} x(t) = Ax(t), \end{equation}

usually written simply as x˙=Ax\dot{x} = Ax where it is to be understood that xx is a continuous function of time tt and x˙\dot{x} denotes differentiation with respect to tt. In this case, AA describes a velocity field in state space – each point xRnx \in \R^n is associated to some velocity AxAx. Analogously, for discrete time dynamical systems, we have an iterative relationship

x(t+1)=Ax(t),\begin{equation} x(t + 1) = Ax(t), \end{equation}

where in this case the matrix AA specifies the next state of the system.

Remark: I like to use tt for both discrete and continuous time. The reader is trusted to recognize which is which (or if it matters at all) from the context.

Somewhat more generally, we could work with systems having some constant offset bRnb \in \R^n as in x˙=Ax+b\dot{x} = Ax + b. However, if AA is a full-rank matrix (meaning that it is invertible), then this additional offset term is not interesting. Indeed, we can just as-well shift the coordinates of the system and instead work with z˙=Az\dot{z} = Az where z=xA1bz = x - A^{-1} b. Anything interesting that can be established for xx can be done so by first doing it for zz. If the matrix AA is not invertible, then there is a subspace which the matrix AA has no effect upon, and this subspace can be analyzed separately. It will become more-or-less obvious how to do this as we proceed, so I will make the standing assumption that AA is full-rank and b=0b = 0.

Remark: The case of linear systems with a general input x˙(t)=Ax(t)+Bu(t)\dot{x}(t) = Ax(t) + Bu(t) on the other hand are quite interesting, and is likely to be the topic of some future post.

Examples

Firstly, it should be understood that much of the analysis of nonlinear dynamical systems, i.e., ordinary differential equations of the form x˙=f(x)\dot{x} = f(x), comes down to the linear case. The reason lies in the Hartman-Grobman Theorem (which we have also already seen ): the nonlinear system x˙=f(x)\dot{x} = f(x) can be approximated by the linear dynamical system x˙=Ax\dot{x} = Ax for in a neighbourhood of an equilibrium point (i.e., point xex_e such that f(xe)=0f(x_e) = 0) with the matrix A=Df(xe)A = \mathsf{D} f(x_e), the derivative of ff. More examples follow.

Gradient Descent

One of the modern drivers of interest in systems theory is for applications in machine learning, or more specifically, optimization algorithms. One of the most famous classical algorithms for optimization is gradient descent: given a function f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R} that we want to minimize, the gradient descent algorithm proceeds through the iterates x(t+1)=x(t)αf(x(t)),x(t + 1) = x(t) - \alpha \nabla f(x(t)), where f(x)=(Df(x))T\nabla f(x) = (\mathsf{D} f(x))^{\mathsf{T}} is the gradient and α>0\alpha > 0 is a step-size parameter. In general, this is a nonlinear system. However, there is a particularly important special case when f(x)\nabla f(x) is a linear function of xx: when f(x)=12xTQxf(x) = \frac{1}{2}x^{\mathsf{T}} Q x is a quadratic function. In this case, f(x)=Qx\nabla f(x) = Qx and gradient descent x(t+1)=(IαQ)x(t)x(t + 1) = (I - \alpha Q)x(t) is a linear system with A=IαQ.A = I - α Q.

Electronic Circuits

Another interesting example comes from electrical engineering. In the design of electronic circuits, there are devices called capacitors (basically two metal plates sandwiched next to each other) which can store a small electric charge and then subsequently release it. Similarly, inductors (coils of wire) store magnetic energy, and later release it.

Using these devices, along with simple resistors, we can construct an RLC circuit – a circuit consisting of resitors, inductors, and capacitors (the ‘L’ for inductors coming from the scientist Heinrich Lenz). One of the simplest such circuits is to just place these three elements in a loop with each other.

Figure 1: A simple series RLC circuit

Figure 1: A simple series RLC circuit

Using Kirchoff’s voltage law , which says that the total voltage differences around a loop must sum to zero, we can obtain the equation VR(t)+VL(t)+VC(t)V_R(t) + V_L(t) + V_C(t) where VR,VL,VCV_R, V_L, V_C are the voltages across the resistor, inductor, and capacitor at time tt, respectively.

The needed facts for this analysis are the the current-voltage relationships: VR(t)=RIR(t),V_R(t) = RI_R(t), VL(t)=LI˙L(t)V_L(t) = L\dot{I}_L(t), and VC(t)=1C0tIC(τ)dτV_C(t) = \frac{1}{C}\int_0^t I_C(\tau)\mathsf{d}\tau with IR,IL,ICI_R, I_L, I_C being the currents passing through the devices, and R,L,CR, L, C denoting the “size” of the devices in units of Ohms (resistance), Henries (inductance), and Farads (capacitance). These are nothing but functions which describe how the devices operate. The other key observation is that, since these devices are all placed in series one after the other along the same wire: IR=IL=ICI_R = I_L = I_C! Thus, we have obtained the integro-differential equation:

RI(t)+LI˙(t)+1C0tI(τ)dτ=0    RI˙(t)+LI¨(t)+1CI(t)=0,\begin{equation} \begin{aligned} RI(t) + L\dot{I}(t) + \frac{1}{C}\int_0^t I(\tau) \mathsf{d} \tau &= 0\\ \implies R\dot{I}(t) + L\ddot{I}(t) + \frac{1}{C}I(t) &= 0, \end{aligned}\notag \end{equation}

which by differentiating through the whole thing results in a second order ODE. We now intend to play a clever trick with this equation – a common source of linear systems. We think of the state variables as being x(t)=(I˙(t),I(t))x(t) = \bigl(\dot{I}(t), I(t)\bigr), and by using the relationship between these derivatives described above we obtain:

[I¨I˙]=[RL1CL10][I˙I],\begin{equation} \begin{bmatrix} \ddot{I}\\ \dot{I} \end{bmatrix}= \begin{bmatrix} -\frac{R}{L} & -\frac{1}{CL}\\ 1 & 0\\ \end{bmatrix} \begin{bmatrix} \dot{I}\\ I \end{bmatrix}, \notag \end{equation}

which is a linear dynamical system. Simulation of this system is straightforward.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from scipy.integrate import solve_ivp

def simulate(t, R, C, L, I0=1.0, I_dot0=0.0):
    A = np.array([[0, 1.0], [-1.0 / (C * L), -R / L]])

    def ode(I):
        return A @ I

    res = solve_ivp(
        fun=lambda t, y: ode(y),
        t_span=(t[0], t[-1]),
        y0=np.array([I0, I_dot0]),
        t_eval=t
    )
    return (res.y[0, :], res.y[1, :])
Figure 2: Simulating an RLC circuit

Figure 2: Simulating an RLC circuit

The reader unfamiliar with this trick is encouraged to study this example, as the technique is very commonly used. The behaviour of the system can now be understood by analyzing the dynamics of this ODE given some set of initial conditions I(0),I˙(0)I(0), \dot{I}(0).

Semidefinite Programming

In convex optimization, a semidefinite program is an optimization problem involving a matrix variable XX, a linear cost function J(x)=tr CTX=i,jCijXijJ(x) = \mathsf{tr}\ C^{\mathsf{T}} X = \sum_{i, j} C_{ij} X_{ij}, a linear equality constraint A(X)=0\mathcal{A}(X) = 0, and a constraint X0X \succeq 0, which means that XX must be a positive semidefinite matrix. If you are not familiar with what this is, it is a matrix which is (1) symmetric and (2) has non-negative eigenvalues. A generic SDP is as follows:

minimizeXRn×ntr CTXsubject toA(X)=0X0.\begin{equation} \begin{aligned} \underset{X \in \R^{n \times n}}{\text{minimize}}&\quad \mathsf{tr}\ C^{\mathsf{T}} X\\ \text{subject to} &\quad \mathcal{A}(X) = 0\\ &\quad X \succeq 0. \end{aligned}\notag \end{equation}

The linear function A(X)=0\mathcal{A}(X) = 0 can be any linear function of XX – examples include AX=0AX = 0, AXATX+Q=0AXA^{\mathsf{T}} - X + Q = 0, etc. Of course, much more general SDPs are possible, but this will be adequate for our purposes.

Practically speaking, solving (relatively small: n<100n < 100) SDPs is fairly straightforward. Here is a basic CVX Program to solve an SDP we will encounter shortly:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from math import inf
import scipy.linalg as la
import cvxpy as cvx


def solve_lyapunov_sdp(A, g, verbose=True):
    """Solves the feasibility problem P > 0, (1 - g) * P - A' P A > 0."""
    n, _ = A.shape
    P = cvx.Variable(shape=(n, n), symmetric=True)
    I = np.eye(n)

    objective = cvx.Minimize(0)
    constraints = [
        P >> I,  # P is positive semi-definite
        (1 - g) * P - A.T @ P @ A >> I
        ]
    problem = cvx.Problem(objective, constraints=constraints)
    problem.solve(solver="CVXOPT", verbose=verbose)
    return problem.value, P.value

Given a matrix AA, and a constant γ\gamma, this program will solve a semidefinite feasibility program to find a matrix PIP \succeq I such that (1γ)PATPAI(1 - \gamma) P - A^{\mathsf{T}} P A \succeq I. This is actually just a computational means of solving a special type of feasability problem called a linear matrix inequality (LMI). Usually, LMIs are strict inequalities where it is required to find P0P \succ 0 and (1γ)PATPA0(1 - \gamma) P - A^{\mathsf{T}} P A \succ 0, and any such PP will do. However, since optimization problems are not usually well defined when strict inequalities are involved (since the feasible region is then not closed, c.f., Weierstrass’ Theorem ) it is necessary to modify the LMI to involve non-strict inequalities. Doing this in a naive way, simply replacing \succ by \succeq, may result in a trivial output P=0P = 0. However, since the LMI is linear in PP, we can just scale the inequality and ask that it be I\succeq I – indeed, if any P0P \succ 0 exists which satisfies the inequality, there must also exist one satisfying PIP \succeq I.

The meaning of this LMI will be explained next.

Lyapunov Stability Theory

The most important question to answer when it comes to linear systems is stability. The term stability is used to invoke the idea of a system which, when slightly perturbed, returns back to its nominal position. However, it might be somewhat of a misnomer, as what is often really meant is convergence: given some starting point x(0)x(0), does the linear system satisfy x(t)0 as tx(t) \rightarrow 0\ \text{as}\ t \rightarrow \infty. There are many methods of determining this, the most natural being an analysis of the eigenvalues of the system matrix AA. Analyzing the eigenvalues however, is a method which is quite particular to the case of simple linear systems. Another method, which has a much greater deal of potential for generalization, is the method of analysis by means of Lyapunov Stability Theory.

Lyapunov Stability

One of the most important techniques for establishing system stability is Lyapunov’s direct method – I’ll explain the idea for discrete time systems, although much of the literature focuses on the continuous time case.

The idea is to find a so-called Lyapunov function V:RnRV: \R^n \rightarrow \R on the state space which captures some generalized notion of “kinetic energy” of the system. A Lyapunov function must satisfy a few properties. The first of which is that V(0)=0,V(0) = 0, but V(x)>0V(x) > 0 everwhere else (i.e., everywhere that x0x \ne 0). I will call this property positivity, and it is this property that makes a Lyapunov function analogous to a generalized notion of kinetic energy: it is positive except at an equilibrium point when the system stops moving. The next key property is that VV must be constructed such that its value decreases along the path traced by the system. That is, if it can be shown that V(x(t+1))<V(x(t))V(x(t + 1)) < V(x(t)) for the system x(t+1)=Ax(t)x(t + 1) = Ax(t), with x(0)=x0Rnx(0) = x_0 \in \R^n being an arbitrary initial condition, then the system can be expected to converge towards 00.

These two properties (positivity and decreasing along trajectories) are not the whole story though. The additional technical property of coerciveness is required: x    V(x)||x|| \rightarrow \infty \implies V(x) \rightarrow \infty. This appears like an esoteric technical requirement, but it is essential. If VV is not coercive, then the level sets of VV, the sets Lα={xRn  V(x)α}L_\alpha = \{x \in \R^n\ |\ V(x) \le \alpha\} may be unbounded, and the trajectory x(t)x(t) can drift off x(t)x(t) \rightarrow \infty while V(x(t))V(x(t)) is stil decreasing.

In the linear case, excellent Lyapunov function candidates (i.e., functions VV which you hope will serve as Lyapunov functions) are quadratic forms V(x)=xTPxV(x) = x^{\mathsf{T}} P x, for some matrix PRn×nP \in \R^{n \times n}. It is quite easy to determine when these functions are positive – this is the case exactly when P0P \succ 0 (the matrix PP is positive-definite), and moreover, V(x)V(x) will be coercive when P0P \succ 0 (the matrix PP is positive definite). Thus, it makes sense to refer to positive coercive functions VV as being positive definite; we can refer to a Lyapunov function for a system as being a positive definite function which is decreasing along trajectories of the system.

So, in the linear case we have the candidate function V(x)=xTPxV(x) = x^{\mathsf{T}} P x and the requirement P>0P > 0. All that remains is to figure out if V(x(t+1))<V(x(t))V(x(t + 1)) < V(x(t)) along trajectories of the system. This can be established by verifying V(Ax)<V(x)V(Ax) < V(x) (why?), that is:

xRn: V(Ax)V(x)<0    xRn: xTATPAxxTPx<0    xRn: xT(ATPAP)x<0    ATPAP0.\begin{equation} \begin{aligned} \forall x \in \R^n:\ V(Ax) - V(x) < 0 &\iff \forall x \in \R^n:\ x^{\mathsf{T}}A^{\mathsf{T}}PAx - x^{\mathsf{T}}Px < 0\\ &\iff \forall x \in \R^n:\ x^{\mathsf{T}}\bigl(A^{\mathsf{T}} P A - P\bigr)x < 0\\ &\iff A^{\mathsf{T}} P A - P \prec 0. \end{aligned}\notag \end{equation}

If we are able to simply find some matrix P0P \succ 0 such that V(x)V(x) is a Lyapunov function, then we have established the convergence of our system. The astute reader will notice that this is a linear matrix inequality and can be solved by means of the above semidefinite feasibility program (with γ=0\gamma = 0).

Remark: Lyapunov stability is a fundamental tool for establishing convergence in nonlinear systems. One of the most elegant examples is that of the gradient flow x˙=f(x)\dot{x} = -\nabla f(x) for a strongly convex function ff (making this assumption for simplicity). In this case, the function ff itself can serve as a Lyapunov function: That is, ddtf(x)=f(x)Tx˙=f(x)22<0\frac{\mathsf{d}}{\mathsf{d}t} f(x) = \nabla f(x)^{\mathsf{T}} \dot{x} = -||\nabla f(x)||_2^2 < 0 except at x=xx = x^\star, the minimizer where f(x)=0\nabla f(x^\star) = 0. Incidentally, this also tells us that the gradient flow monotonically decreases the function ff. Convergence proofs of many algorithms come down to constructing an appropriate Lyapunov function.

Example

Consider a simple example with

A=[0.7501.000.6670.111].\begin{equation} A = \begin{bmatrix} 0.750 & 1.00\\ -0.667 & 0.111 \end{bmatrix}. \notag \end{equation}

Notice that AA does not have any particularly simple properties (triangularity, diagonal dominance…) that would allow us to determine at a glance (unless we are rather speedy when it comes to calculating eigenvalues of 2×22 \times 2 matrices!) that the system x(t+1)=Ax(t)x(t + 1) = Ax(t) is stable. However, we can plug this matrix into our trusty semidefinite program and obtain a matrix

P=[8.4193.6363.63612.109],\begin{equation} P = \begin{bmatrix} 8.419 & 3.636\\ 3.636 & 12.109 \end{bmatrix}, \notag \end{equation}

which proves stability by Lyapunov’s method.

Figure 3: Demonstrating that (V(x(t))) decreases

Figure 3: Demonstrating that (V(x(t))) decreases

I’ve plotted above a figure showing the value of various functions of the state x(t)x(t) for the example. The main observation is that while they all are roughly “decreasing”, it is only the Lyapunov function which is decreasing monotonically. That is not to say that this Lyapunov function is the only function with this property – there are many Lyapunov functions. More subtly, it may also be possible to find starting points such that some other arbitrary function decreases monotonically towards zero from that particular system initialization, but a Lyapunov function will do so from any starting point.

Fast Convergence

Can we modify Lyapunov’s method to somehow establish fast convergence? The answer is yes. Consider the stronger condition:

V(Ax)V(x)<γV(x)    V(Ax)<(1γ)V(x)    V(A2x)<(1γ)V(Ax)<(1γ)2V(x)    V(ATx)<(1γ)TV(x).\begin{equation} \begin{aligned} V(Ax) - V(x) &< -\gamma V(x)\\ \implies V(Ax) &< (1 - \gamma)V(x)\\ \implies V(A^2 x) &< (1 - \gamma)V(Ax) < (1 - \gamma)^2 V(x)\\ &\vdots\\ \implies V(A^T x) &< (1 - \gamma)^T V(x). \end{aligned}\notag \end{equation}

which is establishing a rate of convergence on the Lyapunov function – after TT iterations of the system, the value must have decreased by a factor of (1γ)T(1 - \gamma)^T. It is quite convenient to recognize that the existence of a quadratic Lyapunov function VγV_\gamma which verifies this fast convergence is equivalent to finding an ordinary quadratic Lyapunov function for the modified system with Aγ=11γAA_\gamma = \frac{1}{\sqrt{1 - \gamma}} A.

How do we incorporate the search for some such γ\gamma into our SDP? All we need is that ATPA(1γ)P0A^{\mathsf{T}}PA - (1 - \gamma)P \prec 0, and indeed, we would like to find the largest satisfactory γ\gamma. Unfortunately, I do not believe it is possible to directly encode optimization over γ\gamma into a convex SDP, since the product γPγ P , where both γ\gamma and PP are variables, is non-convex. Instead, since γR\gamma \in \R is just a single parameter, we can apply bisection to find an optimal γ\gamma:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from math import inf

def is_feasible(A, g, verbose=True):
    """Checks if P > 0, (1 - g) * P - A' P A > 0 is feasible."""
    val, _ = solve_lyapunov_sdp(A, g, verbose=verbose)
    return val < inf

def find_optimal_convergence_rate(A, num_iter=20):
    right = 1.0
    left = 0.0
    for it in range(num_iter):
        g = left + (right - left) / 2
        if is_feasible(A, g, verbose=False):
            left = g
        else:
            right = g
    return g

By solving the feasability SDP multiple times, we obtain a value of γ\gamma such that V(ATx)<(1γ)TV(x)V(A^T x) < (1 - \gamma)^T V(x), establishing a fast convergence rate for the system. It should be noted that specialized algorithms to solve the Lyapunov inequality ATPAP0A^{\mathsf{T}} P A - P \prec 0, but doing so by semidefinite programming opens up a more general array of possibilities, which I turn to next.

Remark: It is possible to represent this optimization problem as a joint optimization problem over P,γP, \gamma, but not as a convex optimization problem. However, the resulting problem is quasiconvex.

Stability of Uncertain Linear Systems

So far, I’ve talked about linear systems with a single and perfectly well known matrix AA. However, in many applications it may be the case that the matrix AA is not known exactly, or that it is only an approximation of a real system. For this reason, we would like to establish some degree of robust stability guarantee. To this end, let us suppose that we have some uncertainty set ARn×n\mathbf{A} \subset \R^{n \times n} and that our system matrix AARn×nA \in \mathbf{A} \subset \R^{n \times n}. If this is the case, then we can expect our linear system to satisfy a linear recursive inclusion:

x(t+1)Ax(t)=Δ{Ax(t)  AA}.\begin{equation} x(t + 1) \in \mathbf{A} x(t) \overset{\Delta}{=} \{A x(t)\ |\ A \in \mathbf{A}\}. \notag \end{equation}

That is, the next point x(t+1)x(t + 1) is obtained from x(t)x(t) by simple multiplication by a matrix AA, we just don’t know which matrix AA; all we know is that AAA \in \mathbf{A}. Amazingly, establishing the stability of this system can still be done by looking for a Lyapunov function V(x)=xTPxV(x) = x^{\mathsf{T}} P x with P0P \succ 0 and such that xRn,AA: V(Ax)V(x)<0\forall x \in \R^n, A \in \mathbf{A}:\ V(A x) - V(x) < 0 and this is equivalent to the collection of LMIs: P0,AA: ATPAP0.P \succ 0, \forall A \in \mathbf{A}:\ A^{\mathsf{T}} P A - P \prec 0.

Of course, not all such LMIs can be solved – it depends on the model of A\mathbf{A}. A quite flexible model is furnished by polytopic sets: A=conv{Ak  k[K]}\mathbf{A} = \mathsf{conv}\{ A_k\ |\ k \in [K] \}, that is, the convex hull of a finite number of matrices AkA_k, i.e., for any AAA \in \mathbf{A} there exists some μ[0,1]\mu \in [0, 1] and matrices Ai,AjA_i, A_j such that A=μAi+(1μ)AjA = \mu A_i + (1 - \mu) A_j. These sets can be used to construct arbitrarily accurate approximations of any convex set A\mathbf{A}.

Then, using the fact that V(x)V(x) is convex if P0P \succ 0, we have V(μAix+(1μ)Ajx)μV(Aix)+(1μ)V(Ajx)V(\mu A_i x + (1 - \mu) A_j x) \le \mu V(A_i x) + (1 - \mu) V(A_j x) and therefore a sufficient condition for the stability of the recursive inclusion is to find some P0P \succ 0 which satisfies the system of KK LMIs: P0,k[K]: AkTPAkP0.P \succ 0, \forall k \in [K]:\ A_k^{\mathsf{T}} P A_k - P \prec 0.

Speaking intuitively, if such a PP exists for some uncertain linear system, we can make a conclusion about the robust stability of the system.

Example – Gain Margins and Momentum Gradient Descent

One of the main motivations for considering uncertain linear systems arises from control theory. A very common situation is that we have some system matrix AA, as well as a gain matrix KK which we have designed. The gain matrix KK is intended to be used to steer the system x(t+1)=Ax(t)x(t + 1) = A x(t) towards zero by means of a feedback control x(t+1)=Ax(t)Kx(t)x(t + 1) = Ax(t) - Kx(t). Importantly, the system AA may very well be unstable, in which case the matrix KK is absolutely essential.

Suppose that we have some KK in hand which makes the controlled system x(t+1)=(AK)x(t)x(t + 1) = (A - K)x(t) stable. The idea of a gain margin is to determine if for some interval [α1,α2][\alpha_1, \alpha_2] that the modified system x(t+1)=(AαK)x(t)x(t + 1) = (A - \alpha K) x(t) remains stable, for any α[α1,α2]\alpha \in [\alpha_1, \alpha_2]. If this interval is wide, then it may give us some assurance that the matrix KK is robust, in some sense. The collection of system matrices A={AαK  α1αα2}\mathbf{A} = \{A - \alpha K\ |\ \alpha_1 \le \alpha \le \alpha_2\} forms a natural polytopic set.

Gradient descent constitutes a perfectly good example of this situation. For a quadratic function f(x)=12xTQxf(x) = \frac{1}{2}x^{\mathsf{T}} Q x we have non-stable natural dynamics x(t+1)=x(t)x(t + 1) = x(t), a controller QQ, and the stepsize α\alpha fills in for the gain margin, giving us x(t+1)=(IαQ)x(t)x(t + 1) = (I - \alpha Q)x(t). To make this a bit more interesting, we can augment this system with a momentum term by introducing a velocity v(t)v(t):

v(t+1)=γv(t)+αQx(t)x(t+1)=x(t)v(t+1).\begin{equation} \begin{aligned} v(t + 1) &= \gamma v(t) + \alpha Q x(t)\\ x(t + 1) &= x(t) - v(t + 1). \end{aligned}\notag \end{equation}

These equations are known as gradient descent with momentum . The idea is that if we are heading in some “nominal” direction v(t)v(t), we might as well keep going in that direction, and just add the gradient step αQx(t)\alpha Qx(t) to our velocity. A slight modification of this momentum scheme, Nesterov’s momentum, is known to be optimal in a certain sense.

In any case, the ordinary Momentum gradient descent algorithm corresponds to the (block) system matrix

Aγ,α=[γIαQγIIαQ]\begin{equation} A_{\gamma, \alpha} = \begin{bmatrix} \gamma I & \alpha Q\\ -\gamma I & I - \alpha Q \end{bmatrix}\notag \end{equation}

for the joint system (v(t),x(t))\bigl(v(t), x(t)\bigr).

Now, if you can construct a Lyapunov function (by solving a semidefinite program) for γ,α{γ1,γ2}×{α1,α2}\gamma, \alpha \in \{\gamma_1, \gamma_2\} \times \{\alpha_1, \alpha_2\} (i.e., all four corners of a square) then you will have proof that the algorithm will converge for every pair γ,α[γ1,γ2]×[α1,α2]\gamma, \alpha \in [\gamma_1, \gamma_2] \times [\alpha_1, \alpha_2].

Remark: This example is quite simplistic, but it is illustrative of general and powerful techniques . Specifically, if you can construct and solve an appropriate SDP analytically, then it can serve as a constructive proof of convergence. As well, SDPs can potentially be used to find optimal parameters for algorithms by similar means as were used earlier to calculate convergence rates by means of bisection search.

Quadratic Integrals

As a final illustration of the power of the Lyapunov approach, let’s consider how to evaluate infinite quadratic integrals (or summations, in the discrete case). Say we have a linear system x(t+1)=Ax(t)x(t + 1) = Ax(t), and are interested in computing the value of a of the path of this system according to the quadratic functional

JQ(x0)=t=0x(t)TQx(t);x(0)=x0,\begin{equation} \begin{aligned} J_Q(x_0) = \sum_{t = 0}^\infty x(t)^{\mathsf{T}} Q x(t); x(0) = x_0, \end{aligned}\notag \end{equation}

where Q0Q \succ 0 is some arbitrary positive definite matrix. Such functions arise often in control theory.

To see the connection with Lyapunov theory, recognize that we can fully expand out the summation in terms of x0x_0 by using the fact that x(t)=Atx0x(t) = A^t x_0:

JQ(x0)=t=0(Atx0)TQ(Atx0)=x0T(t=0(At)TQAt)x0=x0TQx0+(t=1(At)TQAt)x0=x0T(Q+t=1(At)TQAt)x0=x0T(Q+AT[t=0(At)TQAt]A)x0=x0TQx0+J(Ax0).\begin{equation} \begin{aligned} J_Q(x_0) &= \sum_{t = 0}^\infty \bigl(A^t x_0 \bigr)^{\mathsf{T}} Q \bigl(A^t x_0)\\ &= x_0^{\mathsf{T}} \Bigl( \sum_{t = 0}^\infty (A^t)^{\mathsf{T}} Q A^t \Bigr)x_0\\ &= x_0^{\mathsf{T}} Q x_0 + \Bigl( \sum_{t = 1}^\infty (A^t)^{\mathsf{T}} Q A^t \Bigr)x_0\\ &= x_0^{\mathsf{T}} \Bigl(Q + \sum_{t = 1}^\infty (A^t)^{\mathsf{T}} Q A^t \Bigr)x_0\\ &= x_0^{\mathsf{T}} \Bigl(Q + A^{\mathsf{T}} \bigl[\sum_{t = 0}^\infty (A^t)^{\mathsf{T}} Q A^t \bigr]A \Bigr)x_0\\ &= x_0^{\mathsf{T}} Q x_0 + J(Ax_0). \end{aligned}\notag \end{equation}

Thus, since JJ is a quadratic function of x0x_0, there must be some PP such that J(x0)=x0TPx0J(x_0) = x_0^{\mathsf{T}} P x_0 and therefore J(x)J(x) satisfies, for any starting point xRnx \in \R^n:

J(x)=xTPx=xT[Q+ATPA]x,\begin{equation} \begin{aligned} J(x) &= x^{\mathsf{T}} P x\\ &= x^{\mathsf{T}}[Q + A^{\mathsf{T}} P A] x, \end{aligned}\notag \end{equation}

or in other words, PATPA=QP - A^{\mathsf{T}} P A = Q. What this tells us is that if we have a matrix P0P \succ 0 verifying the Lyapunov inequality PATPA0P - A^{\mathsf{T}} P A \succ 0, then not only do we have a Lyapunov function, but there is some positive definite matrix Q=PATPAQ = P - A^{\mathsf{T}} P A such that xTPxx^\mathsf{T} P x is the value of the quadratic functional JQ(x)J_Q(x). Conversely, if we can solve the Lyapunov equation PATPA=QP - A^{\mathsf{T}} P A = Q for some P0P \succ 0, then we have a Lyapunov function, as well as a function for easily evaluating the value of JQ(x)J_Q(x), namely, JQ(x)=xTPxJ_Q(x) = x^{\mathsf{T}} P x.

I intend to explore the ideas surrounding quadratic functionals in more detail in a future post. These functions can be used to tie together semidefinite programming duality, stochastic linear systems (with random disturbances), as well as the classical methods of stochastic optimal control: the Kalman filter and the linear quadratic regulator.

Remark: Lyapunov’s equation PATPA=QP - A^{\mathsf{T}} P A = Q is a linear equation in PP and can be solved efficiently. In Python, the function scipy.linalg.solve_discrete_lyapunov can be used for this purpose. Lyapunov equations arise in a number of different ways, so it is a useful pattern to have in mind.

Conclusion

My purpose in writing this has been to explore the Lyapunov approach to analyzing the stability of linear dynamical systems. This is only one of many possible approaches to this analysis, but it also generalizes (in various directions) much more easily than does a direct analysis of the eigenvalues of the system matrix. Indeed, the Lyapunov approach can also lead to global convergence theorems for nonlinear systems, whereas the Hartman-Grobman theorem combined with an eigenvalue analysis can only lead to a local convergence result. Moreover, the semidefinite programming perspective on this problem can also generalize greatly, and opens the door to constructing optimal algorithms. The tradeoff with the Lyapunov approach is that constructing an appropriate Lyapunov function can be extremely difficult, and there is no ready-made recipe for doing so – it often comes down to your creativity!