VideoPoet stands out from other video generation models by integrating multiple video generation capabilities into a single expansive language model.
Summary:
- VideoPoet is a fresh multimodal model capable of processing text, videos, images, and audio to create, edit, and stylize videos.
- Distinguishing itself from other diffusion-based video models, VideoPoet consolidates multiple video generation capabilities within a single Large Language Model (LLM).
Google researchers have introduced VideoPoet, a substantial language model capable of processing multimodal inputs such as text, images, videos, and audio to produce videos.
VideoPoet utilizes a decoder-only transformer architecture that is zero-shot, allowing it to generate content it hasn’t specifically been trained on. The model undergoes a two-step training process, similar to Large Language Models (LLMs): pretraining and task-specific adaptation. The pretrained LLM serves as a base that can be fine-tuned for various video generation tasks, as explained by the researchers.
VideoPoet sets itself apart from competing video models, which often follow a diffusion model by adding noise to training data and then reconstructing it. In contrast, VideoPoet seamlessly integrates various video generation capabilities into a single Large Language Model (LLM) without the need for separately trained components dedicated to specific tasks.
VideoPoet excels in diverse tasks such as text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio generation.
VideoPoet operates as an autoregressive model, generating output by referencing what it had generated earlier. It undergoes training on video, text, image, and audio inputs, employing tokenizers to facilitate the conversion between various modalities.
The researchers express optimism about the potential of Large Language Models (LLMs) in the realm of video generation based on their findings. They envision future enhancements that would enable the framework to support ‘any-to-any’ generation, such as text-to-audio, audio-to-video, and video captioning, among numerous other possibilities.
Text to video
A vaporwave fashion dog in Miami looks around and barks, digital art.
Credits:- Google
Image to video with text prompts
• Teddy bears holding hands, walking down rainy 5th ave
• A skeleton drinking a glass of soda.
• Two pandas playing cards.
Video to audio
The research process involved initially generating 2-second video clips, and VideoPoet independently predicted the audio without relying on any text prompts.
Additionally, VideoPoet demonstrated its capability to craft a short film by compiling multiple short clips. To achieve this, the researchers enlisted Bard, Google’s counterpart to ChatGPT, to compose a short screenplay with specific prompts. Subsequently, they generated video content based on these prompts and combined all elements to produce the final short film.
Longer videos, editing and camera motion
Google stated that VideoPoet addresses the challenge of generating longer videos by conditioning the last second of videos to predict the next second. The researchers noted, “By chaining this repeatedly, we show that the model can not only extend the video well but also faithfully preserve the appearance of all objects even over several iterations.”
VideoPoet exhibits the ability to manipulate the movement of objects in existing videos. For instance, a video featuring the Mona Lisa can be prompted to depict the iconic painting yawning.
Utilizing text prompts, it is possible to manipulate camera angles within existing images.
For instance, the initial prompt generated an image based on the description: “Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river.” Subsequently, additional prompts were introduced, progressing from left to right: “Zoom out, Dolly zoom, Pan left, Arc shot, Crane shot, and FPV drone shot.”