Natural GD

Consider a Gaussian model $Y \mid f(\theta) \sim N( f(\theta), I)$. The negative log-likelihood is then

\[\ell(\theta) = -\log p(Y, \theta) = \frac{1}{2} \| f(\theta) - Y \|^2 + \text{const}.\]

Notice that this is precisely the MSE.

Let $\nabla_\theta \ell = J^\top \cdot (f(\theta) - Y) \triangleq J^\top \cdot e$. So $e = f(\theta)-Y \sim N(0, I)$.

The Fisher information matrix is

\[F = \mathbb{E}_Y [ (\nabla_\theta \ell) (\nabla_\theta \ell)^\top ] = \mathbb{E}[ J^\top e e^\top J ]= J^\top \mathbb{E}[ e e^\top ] J = J^\top J.\]

Thus, for general MSE loss, under the Gaussian assumption, the NGD update is

\[\theta^+ = \theta - \eta (J^\top J)^{-1} \nabla_\theta \ell = \theta - \eta (J^\top J)^{-1} J^\top e.\]

Note $J$ is in shape $n_{y} \times n_{\text{param}}$. So with overparametrization, $J^\top J$ is singular. One can use pseudo inverse instead.

Note for general real matrix $J$, $(J^\top J)^{\dagger} J^\top = J^\dagger$. So the NGD update is

\[\theta^+ = \theta - \eta J^\dagger e.\]

Now let’s check the linear stability of NGD at a minimizer $\theta^*$, i.e., $f(\theta^*)=y$.

The Jacobian of the NGD update map is

\[\partial_\theta (\theta^+) = I - \eta \partial_\theta( J^\dagger e ) = I - \eta \partial_\theta( J^\dagger )\cdot e - \eta J^\dagger \cdot \partial_\theta e.\]

Evaluated at a minimizer $\theta^*$, we have $e^*=0$ and thus

\[\partial_\theta (\theta^+)\mid_{\theta^*} = I - \eta J^\dagger J.\]

Note that $J^\dagger J$ is the orthogonal projection onto the row space of $J$; If the SVD of $J$ is $U\Sigma V^\top$, the SVD of $J^\dagger J$ is $V \text{Diag}(I_r, 0) V^\top$, where $r$ is the rank of $J$.

Consequently, the eigenvalues of $\partial_\theta (\theta^+)\mid_{\theta^*}$ are $1$ and $1-\eta$, which do not depend on the location $\theta^*$.

This means that, from the eyes of NGD, all minimizers have the same sharpness.

Remark: instead of $(J^\top J)^\dagger$ one can also use $(J^\top J + \varepsilon I)^{-1}$, in which case the above statements can fail.

Categories: Learning

Next: Meanfield PDE

Previous: GF for analytic objective