Gram vs Covariance & PCA

Consider a data matrix $X\in\mathbb{R}^{n\times d}$ with $n$ data points and $d$ features. Let $x_i$ denote a row vector / a data point.

The (data) Gram matrix $XX^T \in \mathbb{R}^{n\times n}$ captures the Euclidean geometry (lengths and angles) of the $n$ data points in the $d$ dimensional data space.

Now let’s consider the matrix $X^TX \in \mathbb{R}^{d\times d}$. We can think of features (columns) as functions, defined on the data space. In this sense, the matrix $X^TX$ is the Gram matrix of the discretized feature functions.

This matrix is closely related to the sample second moment matrix, which is $\frac{1}{n}X^TX$, and the sample covariance matrix, which is $\frac{1}{n-1}X^TX$ if the data is centered ($\sum_i x_i=0\in\mathbb{R}^d$).

To explain the spectra of these two matrices, consider the SVD $X=U\Sigma V^T$. The right singular vectors $V=(v_i)\in O(d)$ are the eigenvectors of the feature Gram. Each $v_i$ is a principal direction in the data space. The left singular vectors $U=(u_i)\in O(n)$ are the eigenvectors of the data Gram. Each $u_i$ can be viewed as a normalized “eigenfeature”;
Then, $Xv_i = \sigma_i u_i$ means that, projecting the data matrix onto the principal direction $v_i$ yields the eigenfeature (or score in the PCA language) $\sigma_i u_i$.

So, PCA, in one word, is $XV=U\Sigma$.

Categories: Learning

Next: Repelling directions -> measure-zero attraction basin?

Previous: Eigendecomposition vs SVD