Natural language processing

Diffusion Forcing combines strengths of language and image models for better video generation

Diffusion Forcing combines strengths of language and image models for better video generation
Diffusion Forcing combines strengths of language and image models for better video generation


Researchers have developed a new method called “Diffusion Forcing” that combines the strengths of autoregressive models and diffusion models. The technique enables, among other things, more stable video generation and more flexible planning for robotics tasks.

Scientists from MIT CSAIL and the Technical University of Munich have introduced a new method they call “Diffusion Forcing.” In this approach, the model learns to denoise a sequence of tokens or observations, with each token having its own independent noise level. In this way, the method combines the advantages of autoregressive models, which today power large language models like GPT-4, with those of diffusion models, which have proven successful in image generation, such as in Stable Diffusion.

In next-token prediction, each token is usually “masked” and predicted from the preceding tokens. In full sequence diffusion, the entire sequence is gradually noised, with all tokens having the same noise level.

Diffusion Forcing combines both approaches: Each token, such as each word of a text or each frame of a video, can have its own noise level between 0 (unchanged) and K (pure noise). This way, a sequence can be partially masked. The model thus learns to reconstruct arbitrary subsets of the observed sequences.


During sampling, token-by-token processing can be used as in autoregression, or entire sequences can be denoised at once, depending on the desired use case. By cleverly choosing the noise levels, uncertainty about the future can also be modeled – near tokens are less noisy than distant ones.

Diffusion Forcing generates temporally stable videos and controls robots

The researchers evaluated their method in various applications such as video generation, time series prediction, and robot control. It was shown that Diffusion Forcing delivers better results than previous methods in many cases.

In video generation, for example, conventional autoregressive models can often only provide plausible results for short periods of time. Diffusion Forcing remains stable even for longer sequences.

Video prediction using diffusion forcing and baselines on the Minecraft dataset (0.5x speed). Teacher forcing can easily fail, while diffusion models suffer from serious consistency issues. Stable and consistent video prediction can be achieved with Diffusion Forcing. | Video: Chen et al.

In reinforcement learning scenarios, the model can also plan action sequences of different lengths, depending on the requirements of the current situation. Similar to diffusion models for images, the method can also be used to guide the generation towards specific goals.


Visualization of the diffusion forcing planning process using a simple maze as an example. To model the causal uncertainty of the future, the diffusion plan can have a near future with a lower noise level and a distant future with a higher noise level – visualized here by the color. | Video: Chen et al.

The method can treat incoming observations as noisy to be robust against distractions. In the video above, the team shows how a robotic arm controlled by Diffusion Forcing continues its task despite the visual disturbance of a shopping bag randomly thrown into the workspace.| Video: Chen et al.

The researchers now want to further improve the method and apply it to larger datasets. The team conducted most of the experiments with a small RNN model; larger datasets or high-resolution videos require large transformer models. However, initial experiments with transformers are already underway. If the method scales well, Diffusion Forcing could soon take over many tasks and deliver more robust and better results.

Diffusion Forcing combines strengths of language and image models for better video generation

Source link