Problem 12: Deterministic $CUR$-Type Decompositions
Suggested by | Michael Mahoney |
---|---|
Source | Kanpur 2006 |
Short link | https://sublinear.info/12 |
A $CUR$-decomposition of $A$ expresses $A$ as a product of three matrices, $C$, $U$, and $R$, where $C$ consists of a small number of actual columns of $A$, $R$ consists of a small number of actual rows of $A$, and $U$ is a small, carefully constructed matrix that guarantees that the product $CUR$ is “close” to $A$. Recent work [DrineasMM-06,DrineasMM-06a,DrineasMM-06b] proved the existence and provided efficient randomized algorithms for $CUR$ decompositions that are nearly as good as the best rank-$k$ approximation to $A$ that is obtained by truncating the SVD. Hence, the columns of $A$ that are included in $C$, as well as the rows of $A$ that are included in $R$, can be used in place of the eigencolumns and eigenrows, with the added benefit of improved interpretability in terms of the original data. Note the structural simplicity of a CUR matrix decomposition: \begin{equation*} \underbrace{\left( \begin{array}{ccccc} &&&& \\ &&&& \\ &&A&& \\ &&&& \\ &&&& \end{array} \right)}_{m \times n} \approx \underbrace{\left(
\begin{array}{ccc} && \\ &&\\ &C& \\ && \\ && \end{array}
\right)}_{m \times c} \underbrace{ \left(
\begin{array}{ccc} && \\ &U& \\ && \end{array}
\right) }_{c \times r} \underbrace{\left(
\begin{array}{ccccc} &&&& \\ &&R&& \\ &&&& \end{array}
\right)}_{r \times n} . \end{equation*}
We briefly expand on the latter point. In many cases, an important step in data analysis is to construct a compressed representation of $A$ that may be easier to analyze and interpret. The most common such representation is obtained by truncating the SVD at some number $k \ll \min \{m,n\}$ terms, in large part because this provides the “best” rank-$k$ approximation to $A$ when measured with respect to any unitarily invariant matrix norm. Unfortunately, the basis vectors (the so-called eigencolumns and eigenrows) provided by this approximation (and with respect to which every column and row of the original data matrix is expressed) are notoriously difficult to interpret in terms of the underlying data and processes generating that data. Gould, in the “Mismeasure of Man” [Gould-96], provides examples where such reification of the singular vectors (or principal components or “factors”) resulted in social policy with potentially devastating consequences for large groups. For example, the vector $[(1/2)\,\mbox{age} - (1/\sqrt{2})\,\mbox{height} + (1/2)\,\mbox{income}]$, being one of the significant uncorrelated “factors” from a dataset of people's features, is not particularly informative. From an analyst's point of view, it would be highly preferable to have a low-rank approximation that is nearly as good as that provided by the SVD but that is expressed in terms of a small number of actual columns and/or actual rows of a matrix, rather than linear combinations of those columns and rows. Our $CUR$ matrix decomposition is a direct formulation of this problem. For example, the $CUR$ matrix decomposition was recently applied to hyperspectrally-resolved medical imaging data [MahoneyMD-06]. In this application, a column corresponds to an image at a single physical frequency and a row corresponds to a single spectrally-resolved pixel, and it was shown that data reconstruction and classification tasks can be performed with little loss in quality even after substantial data compression.
The main existing result for $CUR$ matrix decompositions is the following.
- Theorem (Drineas et al. [DrineasMM-06b]). Given a matrix $A \in \mathbb{R}^{m \times n}$ and an integer $k \ll \min \{m,n\}$, there exist randomized algorithms such that if $c = O(\epsilon^{-2} k \log k \log (1/\delta))$ columns of $A$ (in expectation) are chosen to construct $C$, and then $r = O(\epsilon^{-2}c \log c \log (1/\delta))$ rows of $A$ (in expectation) are chosen to construct $R$, then with probability at least $1-\delta$,
- \begin{equation*}\newcommand{\FNorm}[1]{\|#1\|_{\rm F}}\FNorm{A - CUR} \leq (1+\epsilon) \FNorm{A-A_k}.\end{equation*}
- Here, the matrix $U$ is a weighted Moore-Penrose inverse of the intersection between $C$ and $R$, and $A_k$ is the best rank-$k$ approximation to $A$. The randomized algorithm runs in time $O(\operatorname{SVD}(A_k))$, which is the time required to compute the best rank-$k$ approximation to the SVD [GolubV-89].
Many important questions remain open within the context of $CUR$-type decompositions. The most important one is to devise deterministic algorithms. Whereas, from a theoretical viewpoint, the randomized algorithms are satisfactory, deterministic algorithms would be much preferable. Results of Gu and Eisenstat [GuE-96] and Stewart [Stewart-99,Stewart-04] may help towards this goal. Also relevant is work by Goreinov, Tyrtyshnikov, and Zamarashkin [GoreinovTZ-97,GoreinovT-01] that was motivated by applications such as scattering, in which large coefficient matrices have blocks that can be easily approximated by low-rank matrices. They showed that if the matrix $A$ is approximated by a rank-$k$ matrix to within an accuracy $\epsilon$ then there exists a choice of $k$ columns and $k$ rows, i.e., $C$ and $R$, and a low-dimensional $k \times k$ matrix $U$ constructed from the elements of $C$ and $R$, such that $A \approx CUR$ in the sense that $\newcommand{\TNorm}[1]{\|#1\|_2}\TNorm{A-CUR} \le \epsilon f(m,n,k)$, where $ f(m,n,k) = 1 + 2\sqrt{km} + 2\sqrt{kn} $. In [GoreinovTZ-97], the choice for these matrices is related to the problem of determining the minimum singular value $\sigma_k$ of $k \times k$ submatrices of $n \times k$ orthogonal matrices. In addition, in [GoreinovT-01] the choice for $C$ and $R$ is interpreted in terms of the maximum volume concept from interpolation theory, in the sense that columns and rows should be chosen such that their intersection $W$ defines a parallelepiped of maximum volume among all $k \times k$ submatrices of $A$.
A second research topic is to improve the error bounds of previous results, and improve the dependency of the number of sampled columns and rows on $k$ and $\epsilon$. Again, the aforementioned results from the numerical linear-algebra community will serve as starting points.