Whether you’re an aspiring filmmaker or a creator who loves making videos for your audience, Meta believes everyone should have access to tools that enhance their creativity. Today, the tech giant is premiering its breakthrough generative AI research project, “Movie Gen,” which includes modalities like image, video, and audio.
Meta’s latest research demonstrates how users can leverage simple text inputs to produce custom videos and sounds, edit existing videos, and even transform personal images into unique video content. According to the company, Movie Gen outperforms similar models in the industry across these tasks when evaluated by humans.
This work builds on Meta’s long history of sharing fundamental AI research with the community. The company’s first wave of generative AI started with the “Make-A-Scene” series, enabling the creation of image, audio, video, and 3D animation. The advent of diffusion models then brought a second wave with “Llama Image” foundation models for higher quality image and video generation, as well as editing. Now, Movie Gen represents Meta’s third wave, combining these modalities and enabling even finer-grained control for creators.
While there are many exciting use cases, Meta is quick to note that generative AI is not a replacement for artists and animators. The company is sharing this research because it believes in the power of this technology to help people express themselves in new ways and provide opportunities to those who might not otherwise have them. Meta’s hope is that one day, everyone will be able to bring their artistic visions to life and create high-definition videos and audio using Movie Gen.
Under the Hood of Movie Gen
Meta’s new AI model, Movie Gen, has four key capabilities: video generation, personalized video generation, precise video editing, and audio generation. The company has trained these models on a combination of licensed and publicly available datasets.
- Video Generation: Given a text prompt, Meta can leverage a joint model optimized for both text-to-image and text-to-video to create high-quality, high-definition images and videos. This 30 billion parameter transformer model can generate videos up to 16 seconds at 16 frames per second, reasoning about object motion, subject-object interactions, and camera motion.
- Personalized Videos: Meta has expanded the foundation model to support personalized video generation. By taking a person’s image and a text prompt, the model can generate a video containing the reference person and rich visual details.
- Precise Video Editing: The editing variant of Movie Gen takes video and text prompts as input, performing localized edits like adding, removing, or replacing elements, and global changes like background or style modifications. Unlike traditional tools or generative models, it preserves the original content, targeting only the relevant pixels.
- Audio Generation: Meta has trained a 13 billion parameter audio generation model that can take a video and optional text prompts and generate high-quality, high-fidelity audio up to 45 seconds, including ambient sound, sound effects, and background music—all synced to the video content.
Looking Ahead
Meta’s human evaluations show Movie Gen outperforming industry competitors across these capabilities. While the current models have limitations, the company says it will continue improving them and working closely with filmmakers and creators to integrate their feedback.
By taking a collaborative approach, Meta wants to ensure it’s creating tools that help people enhance their inherent creativity in new ways. Imagine animating a “day in the life” video to share on Reels, or creating a customized animated birthday greeting. With creativity and self-expression taking charge, the possibilities are indeed infinite.