Meta’s new AI model “Movie Gen” brings text to life with video, image, and audio generation
Meta has introduced Movie Gen, a new AI model that generates videos, images and audio from text input. It can also edit existing videos.
At the core of Movie Gen is a 30-billion-parameter transformer model for video and image generation. It produces videos up to 16 seconds long at 16 frames per second, with support for different aspect ratios (1:1, 9:16, 16:9) at 768 × 768 pixel resolution. An additional upscaler can increase the resolution to Full HD (1080p).
A separate 13-billion-parameter model handles audio generation. It can create sound, background music, and sound effects to match videos up to 45 seconds long at a 48 kHz sampling rate.
Ad
Movie Gen also includes video editing capabilities that can modify existing videos using text instructions. Another feature allows users to create personalized videos by combining a photo of a person with a text description.
Meta claims performance edge
Meta says Movie Gen outperforms similar models from companies like Runway, Sora, LumaLabs, Kling and Pika in human ratings. The gap appears smallest with Sora and Kling. Sora reportedly can produce consistent videos up to one minute long at a higher frame rate than Movie Gen claims.
The company trained the models using licensed and publicly available datasets. The video generation model was pre-trained on about 100 million videos and one billion images. The audio model used approximately one million hours of audio data. More details can be found in the paper.
Recommendation
Movie Gen is currently for research purposes and not publicly available. Meta plans to work with filmmakers and creatives to incorporate feedback before a potential release.
The third generation of Meta’s AI media models
Meta describes Movie Gen as the third generation of its AI media models, combining all previous modalities and allowing for more precise control. The company believes that the models could enable various new products.
However, Meta admits that the current models still have limitations. In particular, the inference time and the quality of the models could be improved by further scaling. Challenges remain with complex geometry, object manipulation, physics, and audio synchronization for dense or occluded motion.
Meta stresses that the technology is not meant to replace artists and animators, but to create new forms of expression. The company mentions animated “day in the life” videos for Instagram Reels or personalized birthday greetings for WhatsApp as possible applications.