SVD and low-rank approximation

The singular value decomposition (SVD) factorizes a matrix A=U\Sigma V^* where U,V are unitary and \Sigma is diagonal with nonnegative entries. Unlike the eigenvalue decomposition, every (real or complex) matrix is guaranteed to have a SVD. Geometrically, the SVD implies that every linear transformation is equivalent to a rotation, a scaling, then another rotation.

In data analysis, the SVD can be computed to perform principal component analysis (PCA), where an ellipsoid is used to “fit” a data matrix. In other words, once the SVD is computed, a partial SVD, A_k=\sum_{i=1}^{k} \sigma_i u_i v_i^* can be used to approximate the matrix, where u_i and v_i are the left and right singular vectors, and \sigma_i are the singular values in decreasing order. Equivalently, a partial SVD is obtained by replacing the smaller singular values with zero in the factorization. Intuitively, with the length of each ellipsoid’s component equal to its corresponding singular value, keeping only the components with non-trivial length makes the ellipsoid a good fit to the data. Formally, the Eckart-Young-Mirsky Theorem states that a partial SVD provides the best approximation to A among all low-rank matrices.

Let A and B be any m\times n matrices, with B having rank at most k. Then
\| A-A_k \|_2 \leq \| A-B \|_2,
\| A-A_k \|_F \leq \| A-B \|_F.

The theorem has a short proof for either norm, but here we describe a slightly longer proof using spectral method. Before proceeding, since we will work with singular values of different matrices, we will write \sigma_i(\cdot) to denote the i-th singular value of a matrix. We first state the following lemma, which provides an estimate on how much a low-rank perturbation to a matrix can affect its singular values.

Let A and B be any m\times n matrices, with B having rank at most k. \sigma_{i+k}(A) \leq \sigma_i(A-B).

The theorem is then an easy application of the above lemma.

(Of theorem) Recall that the \ell_2 norm of a matrix is equal to its largest singular value. So

    \[ \| A - A_k \|_2 = \sigma_1(A-A_k) = \sigma_{k+1}(A),\]

which by the lemma (setting i=1) is at most \sigma_1(A-B)=\| A-B \|_2.

For Frobenius norm, recall that the square of the Frobenius norm of a matrix is equal to the sum of the squares of its singular values. Applying the lemma above, we have

    \begin{eqnarray*} \| A-A_k \|_F^2 & = & \sum_{i=1}^{n} \sigma_i(A-A_k)^2 \\ & = & \sum_{i=1}^{n-k} \sigma_{i+k}(A)^2 \\ & \leq & \sum_{i=1}^{n}(A-B)^2 \\ & = & \| A - B \|_F^2, \end{eqnarray*}

finishing the proof.

The lemma is almost a re-statement of one of Weyl’s inequalities:

    \[ \sigma_{i+j-1}(X+Y) \leq \sigma_i(X) + \sigma_j(Y). \]

To see this, let X=A-B, Y=B, and j=k+1, and note that \sigma_{k+1}(B) is zero. Instead of invoking this inequality directly, we will prove the lemma from first principle.

(Of lemma) We first prove the case when i=1, i.e., \sigma_{k+1}(A) \leq \sigma_1(A-B).
Let V be the vector space spanned by the top k+1 right singular vectors v_1,\ldots,v_{k+1}, and W be the null space of B. By dimension argument, there must exist a non-zero vector w in V \cap W. Then bounding from above, we have

    \[ \|Aw\|_2 = \|(A-B)w\|_2 \leq \sigma_1(A-B) \|w\|_2. \]

Bounding from below and using the facts that w is orthogonal to v_i when i > k+1, Pythagorean Theorem, decreasing magnitude of singular values, and w lies in the orthonormal set \{v_1,\ldots,v_k\}, respectively, we obtain

    \begin{eqnarray*} \norm{Aw}_2^2 & = & \norm{\sum_{i=1}^{k+1} \sigma_i u_i \langle v_i, w\rangle}_2^2 \\ & = & \sum_{i=1}^{k+1} \sigma_i^2 \langle v_i, w \rangle ^2 \\ & \geq & \sigma_{k+1}^2 \sum_{i=1}^{k+1} \langle v_i, w \rangle ^2 \\ & = & \sigma_{k+1}^2 \|w\|_2^2. \end{eqnarray*}

Combining both sides yields \sigma_{k+1}(A) \leq \sigma_1(A-B).

Now we prove the general case. Using the i=1 case, we know that for any matrix C with rank at most i+k-1,

    \[ \sigma_{i+k}(A) \leq \sigma_1(A-C). \]

Writing C=C'+C'', we can write A-C = (A - B - C') + (B - C''). Since \sigma_1(X+Y) \leq \sigma_1(X) + \sigma(Y) (expressing \sigma_1 as the inner product \langle v, (X+Y)u \rangle and applying Cauchy-Schwarz, where u and v are the top left and right singular vectors of X+Y, respectively) , we have that

    \[ \sigma_{i+k}(A) \leq \sigma_1(A-B-C') + \sigma_1(B-C''). \]

Since the above inequality holds for arbitrary C with rank at most i+k-1, we can choose C'=(A-B)_{i-1} and C''=B_k. Then we can conclude that

    \[ \sigma_{i+k}(A) \leq \sigma_{i}(A-B) + \sigma_{k+1}(B), \]

completing the proof since \sigma_{k+1}(B) is zero.

1 thought on “SVD and low-rank approximation

Leave a Reply

Your email address will not be published. Required fields are marked *