OpenAI’s Sora represents the cutting edge of GenAI technology, as it can generate videos and interactive 3D environments on demand. It is a significant milestone in the field.
Interestingly, the groundwork for Sora came from an AI model architecture known as the diffusion transformer, which made its debut in AI research some years ago (source).
The diffusion transformer, also powering Stability AI’s latest image generator Stable Diffusion 3.0, is poised to revolutionize the GenAI field by enabling models to scale up beyond previous limits.
In June 2022, Saining Xie, a computer science professor at NYU, initiated the research project that led to the creation of the diffusion transformer. Teaming up with William Peebles, who worked as an intern at Meta’s AI research lab and now co-leads Sora at OpenAI, Xie combined the concepts of diffusion and transformer in this innovative model.
The typical AI-powered media generators, like OpenAI’s DALL-E 3, utilize a diffusion process to produce various media types such as images, videos, and music. This involves gradually adding noise to media until it becomes unrecognizable, creating a dataset of noisy media that the model then learns to refine to achieve the desired output.
Transformers, as seen in models like GPT-4 and ChatGPT, excel in complex reasoning tasks due to their attention mechanism, which analyzes input data and generates output by weighing the relevance of each input element.
Transformers have been shown to enhance scalability and performance in models like Sora, allowing them to process vast amounts of data and utilize extensive parameters to achieve remarkable results.
With the advancements in diffusion transformers demonstrated by projects like Sora, it is clear that transformers are replacing U-Nets in diffusion models for increased efficiency and scalability.
These innovations in diffusion transformers have paved the way for more efficient and scalable models like Sora and Stable Diffusion 3.0, hinting at a promising future for AI-generated content.