Natural language processing

OmniGen 2 blends image and text generation like GPT-4o, but is open source

OmniGen 2 blends image and text generation like GPT-4o, but is open source



summary
Summary

Researchers at the Beijing Academy of Artificial Intelligence have released OmniGen 2, an open-source system for text-to-image generation, image editing, and contextual image creation.

Unlike the original OmniGen, which launched in November 2024, OmniGen 2 uses two distinct decoding paths: one for text and one for images, each with separate parameters and a decoupled image tokenizer. According to the team, this setup allows the model to build on existing multimodal language models without sacrificing their core text generation skills.

Collage of surreal motifs: space staircase, fantasy heroes, nature panoramas, glowing objects, and portraits.
OmniGen 2 handles a range of prompts and artistic styles, though its photorealistic images still appear a bit blurry. | Image: Wu et al.

The backbone is a multimodal large language model (MLLM) based on the Qwen2.5-VL-3B transformer. For image generation, OmniGen 2 uses a custom diffusion transformer with about four billion parameters. The model switches from writing text to generating images when it encounters a special “<|img|>” token.

Architecture diagram: Auto-Regressive Transformer processes text and image tokens, providing hidden states for a diffusion transformer with VAE and refiner modules.
OmniGen 2 uses separate decoding paths: an autoregressive transformer for text and a diffusion transformer for images. This helps maintain language skills while producing high-quality visuals. | Image: Wu et al.

Training used roughly 140 million images from open source datasets as well as proprietary collections. The researchers also developed new techniques that leverage video, extracting similar frames – for example, a face with and without a smile – and using a language model to create the corresponding editing instructions.

Ad

AI image editing: nine examples of style, color, extraction, addition, replacement, facial expressions, removal, movement, and background.
OmniGen 2 lets users make local edits without regenerating the entire image. | Image: Wu et al.

For contextual image generation, OmniGen 2 tracks people or objects across multiple video frames, helping the model learn how a single subject appears in different scenarios.

Collage: nine AI image edits with object design, scene compositing, character replacement, anime hybridization, and background replacement.
OmniGen 2 can merge multiple input images into a single result. | Image: Wu et al.

Novel position embedding for multimodal prompts

The team introduced a new “Omni-RoPE” position embedding that splits position information three ways: a sequence and modality ID to distinguish images, and 2D coordinates for each image element. This helps the model keep track of multiple inputs and combine them spatially.

Diagram of the Omni-RoPE method: A text instruction and two input images are combined into an output image, with each element being assigned a unique ID and coordinates.
Omni-RoPE assigns each element – text or image – a unique ID, letting the model accurately combine multiple inputs. | Image: Wu et al.

A unique aspect of OmniGen 2 is that it uses VAE (Variational Autoencoder) features exclusively as input for the diffusion decoder, instead of integrating them into the main language model. This design choice streamlines the architecture and helps preserve the model’s core language understanding.

Reflection mechanism for iterative improvement

A key feature is OmniGen 2’s reflection mechanism, which lets the model evaluate its own images and improve them in several rounds. The system spots flaws in the generated image and suggests specific fixes.

Collage of four chat conversations with incorrect image prompts and their corrections for accurate image generation.
The reflection mechanism allows OmniGen 2 to refine images automatically. | Image: Wu et al.

Since there were no strong benchmarks for contextual image generation, the researchers introduced the OmniContext benchmark. It includes three categories – Character, Object, and Scene – with eight subtasks and 50 examples each.

Recommendation

Evaluation is done by GPT-4.1, which scores prompt accuracy and subject consistency from 0 to 10. OmniGen 2 scored 7.18 overall, outperforming all other open-source models. GPT-4o, which recently added native image generation, scored 8.8.

For text-to-image generation, OmniGen 2 posted competitive results on key benchmarks like GenEval and DPG-Bench. In image editing, it set a new state-of-the-art among open-source models.

There are still some gaps: English prompts work better than Chinese, body shape changes are tricky, and output quality depends on the input image. For ambiguous multi-image prompts, the system needs clear instructions for object placement.

The team plans to release the models, training data, and build pipelines on Hugging Face.

OmniGen 2 blends image and text generation like GPT-4o, but is open source

Source link