paper-reading2

Posted on 2024-07-15 Edited on 2024-07-29 In paper-reading Views: Word count in article: 7.7k Reading time ≈ 7 mins.

paper-reading

Generative Modeling by Estimating Gradients of the Data Distribution

likelihood-based methods
1. approach: uses log-likelihood (or a suitable surrogate) as the training objective.
2. intrinsic limitations: either have to use specialized architectures to build a normalized probability model (e.g., autoregressive models, flow models), or use surrogate losses (e.g., the evidence lower bound used in variational auto-encoders , contrastive divergence in energy-based models) for training.
generative adversarial networks
1. approach: uses adversarial training to minimize f-divergences or integral probability metrics between model and data distributions.
2. intrinsic limitations: their training can be unstable due to the adversarial training procedure.In addition, the GAN objective is not suitable for evaluating and comparing different GAN models.

two main challenges with new approach

if the data distribution is supported on a low dimensional manifold—as it is often assumed for many real world datasets—the score will be undefined in the ambient space, and score matching will fail to provide a consistent score estimator.
the scarcity of training data in low data density regions, e.g., far from the manifold, hinders the accuracy of score estimation and slows down the mixing of Langevin dynamics sampling.

Our sampling strategy is inspired by simulated annealing [30, 37] which heuristically improves optimization for multimodal landscapes.

优化目标为

令

则

有参数的为估计分布，无参数的为真实分布

只需证明

考虑第 i 维，只需证

而

证毕

这个定理把不可达的转化为可达的，从而使得可以进行优化。但问题是对于较高维度的数据，偏导计算量过大。

代表加噪后图片的分布

通过条件分布，将原本不可解的目标变成可解的了

则有

表示两个是相同的优化目标，或者说

从这个定理可以看出，DSM相当于是在加噪图片的空间中进行ESM的优化。
只有在噪声的强度足够小时，才能认为最终的梯度为真实样本空间上的梯度。

将带入原式

加噪的好处：

since the support of our Gaussian noise distribution is the whole space, the perturbed data will not be confined to a low dimensional manifold, which obviates difficulties from the manifold hypothesis and makes score estimation well-defined.
large Gaussian noise has the effect of filling low density regions in the original unperturbed data distribution; therefore score matching may get more training signal to improve score estimation.
by using multiple noise levels we can obtain a sequence of noise-perturbed distributions that converge to the true data distribution. We can improve the mixing rate of Langevin dynamics on multimodal distributions by leveraging these intermediate distributions in the spirit of simulated annealing and annealed importance sampling

选择的初衷是想让不同噪声强度的loss在同一规模。因为

所以令

选择步长大小的初衷是保持信噪比为常数。

则