papers_we_read

Matryoshka Diffusion Models

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind & Navdeep Jaitly , ICLR-2024

Summary

Generative models have made a big step forward with Matryoshka Diffusion Models (MDM), which specialize in creating sharp images and videos. MDM stands out by using a unique structure and approach, breaking away from the usual sequential or latent diffusion models. Unlike traditional models, MDM operates in pixel space instead of latent space, making it more user-friendly, especially with text representations. This advancement is crucial for tasks that need seamless high-resolution generation without the added complexity of multi-step training or inference processes

Main Contribution

Method

\[\begin{equation} q(z_t^r|x)=\mathcal{N}(z_t^r;\alpha_t^rD^r(x), {\sigma_t^r}^2I), \end{equation}\]
where $D^r: \mathbb{R}^N\rightarrow\mathbb{R}^{N_r}$ is a deterministic ``down-sample’’ operator depending on the data. Here, $D^r(x)$ is a coarse / lossy-compressed version of $x$.For instance, $D^r(.)$ can be $\texttt{avgpool}(.)$ for generating low-resolution images and ${\alpha^r_t, \sigma^r_t}$ are the resolution-specific noise schedule.The autors shift the noise schedule based on the input resolutions.MDM then learns the backward process $p_\theta(z_{t-1} z_t)$ with $R$ neural denoisers $x_\theta^r(z_t)$.Each variable $z^r_{t-1}$ depends on all resolutions { ${z_t^1…z_t^R}$ } at time step $t$. During inference, MDM generates all $R$ resolutions in parallel. There is no dependency between $z^r_t$.

Results

MDM outperforms other alternatives, delivering superior results and achieving convergence more rapidly.

Results_plot

Samples from the model trained on CC12M at 1024 resolution

cc12m_1024_3 5

Two Cents