Microsoft is leading the next wave in video production using artificial intelligence, as the company has introduced a new model for video generation called DragNUWA.
The goal of this model is to provide precise control over video production by utilizing text, images, and paths as three fundamental controlling factors. This aims to facilitate the production of highly controllable videos in terms of meaning, location, and timing.
Artificial intelligence companies are competing to achieve optimal development in video production using AI technologies. Many companies in this field have released models capable of producing different video clips, using information derived from texts and images.
The DragNUWA model allows users to interact directly with the backgrounds of images or the objects within them, easily translating these actions into camera movements or object motions, resulting in a cohesive video production.
In addition to conventional methods like text-based prompts and image-based prompts, the model introduces path-based generation as a new approach.
This enables users to manipulate objects or video frames through specified paths. This provides an easy way to produce videos that can be heavily controlled in terms of meaning, location, and timing, while ensuring high-quality output simultaneously.
Microsoft has presented user-friendly standards for the model in an open-source manner, offering a demonstration of the project for the community to experience.
Video production with the help of artificial intelligence focuses on texts, images, or input information based on paths, with each method facing challenges in providing precise control over the desired outcomes.
Merging text and images alone cannot convey the intricate motion details present in videos, and images may not adequately represent futuristic scenarios. Texts and images can lead to ambiguity when expressing abstract concepts.
In August 2023, Microsoft’s AI team proposed a model named “DragNUWA” to overcome this issue, which is based on an open-range model that combines the three factors.
This allows users to precisely define the required text, image, and path inputs to control various aspects, such as camera movements, including zoom effects, object motion in the resulting video.
The path provides detailed motion information, texts give details about future events, and images help distinguish objects from each other.
As part of its tests, Microsoft claimed that this model is capable of executing camera and object movements with precise directional characteristics, starting from multiple drag paths.