OpenAI Pushes Envelope Again With Sora Video Model
- By Paul Mah
- February 21, 2024
OpenAI last week unveiled Sora, a text-to-video AI model with the potential to upend the advertising and the video industry with its ability to generate photorealistic, high-resolution videos of up to a minute.
Earth-shattering capabilities
Built on past research in DALL-E and GPT models, Sora is a diffusion model and uses a transformer architecture. Videos and images are represented as collections of smaller units of data called patches, each of which is akin to a token in GPT.
Under the hood, Sora offers a range of jaw-dropping capabilities. Beyond the ability to generate a video solely from text instructions, it can take an existing still image and generate a video from it, or animate the contents of a still image.
Moreover, the model can take an existing video and extend it, fill in missing frames, or create multiple shots within a single generated video and accurately reflect the same characters and visual style.
“Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world,” says OpenAI researchers.
Room for improvement
There is room for improvement. According to OpenAI, Sora may struggle to simulate the physics of a complex scene and may not understand specific instances of cause and effect. For instance, a video of a man taking a bite out of a cookie might later show the cookie not having a bite mark.
OpenAI says it is currently building tools to help detect misleading content with a detection classifier to tell when a video was generated by Sora.
OpenAI says it plans to include Coalition for Content Provenance and Authenticity (C2PA) metadata in the future should Sora be deployed in a product. The C2PA standard is an open technical standard to certify the source and provenance of media content.
C2PA increases the size of the resulting media file slightly and is currently implemented in OpenAI’s DALL-E 3 text-to-image model. Users can use sites like Content Credentials Verify to check the origins of an image or video.
In closed testing
Sora is capable of generating images, and OpenAI also thinks it exhibits capabilities suitable for use to simulate digital worlds, too.
“[The capabilities] suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them,” wrote the OpenAI team in its technical report.
While OpenAI posted dozens of high-resolution videos it says are generated from Sora, the text-to-video AI model is currently only available to researchers to assess “critical areas for harms or risks” and a select group from the video industry.
“We are also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals.”
You can read more about Sora here.
Image credit: OpenAI (Still shot of video)
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.