Let $R$ be the loss functional defined on the space probability measures $P(\mathbb{R}^n)$:
\[R(\rho) = \sum_j \Big( \int \sigma(x_j, \theta) d \rho(\theta) - y_j \Big)^2\]where $\theta \in \mathbb{R}^n$ represents the layer-wise parameter and $\sigma$ is the activation function. $\rho$ describes a partical system.
With direct computation, we have that
\[\frac{\partial R}{\partial \rho}(\rho)(\eta) = \int \sum_j (\int \sigma\left(x_j, \theta^{\prime}\right) d \rho\left(\theta^{\prime}\right)-y_j) \cdot \sigma\left(x_j, \theta\right) d \eta(\theta) \triangleq \int \Psi(\rho, \theta) d \eta(\theta).\]Interpreting $\Psi_\rho = \Psi(\rho, \cdot) $.
Fix a system $\rho$. Consider moving a partical with mass $1$ from $\theta$ to $\theta+\Delta \theta$ (without changing the total mass). Loosely, the first-order change on the loss $R$ is:
\[\Delta R \approx \frac{\partial R}{\partial \rho}(\rho)(- \delta_\theta + \delta_{\theta+\Delta \theta}) = \Psi_\rho(\theta+\Delta \theta) - \Psi_\rho(\theta) \approx \langle \nabla_\theta \Psi_\rho(\theta) , \Delta \theta \rangle.\]Therefore, at a given position $\theta$, the fastest way to decrease $R$ is to move the partical along the direction $-\nabla_\theta \Psi_\rho(\theta)$.
Recall that the continuity equation is
\[\partial_t \rho_t = - \nabla \cdot (\rho_t \cdot v)\]where $\rho_t \cdot v$ is called the flux and $v$ is the velocity field that describes the flux. Putting everything together, we have the meanfield PDE:
\[\partial_t \rho_t = \nabla_\theta \cdot (\rho_t \cdot \nabla_\theta \Psi(\rho_t, \theta)).\]