Why world models could be the future of AI

By
Nat Rubio-Licht

Nov 13, 2025

12:33pm UTC

Copy link
Share on X
Share on LinkedIn
Share on Instagram
Share via Facebook

Today’s most popular AI models are great with words.

But when given tasks beyond letters and numbers, these models often fail to grasp the world around them. Conventional AI models tend to flounder when faced with real-world tasks, struggling to understand things like physics and causality. It’s why self-driving cars still struggle with edge cases, resulting in safety hazards and traffic law violations. It’s why industrial robots still need tons of training before they can be trusted to not break the things – or people – around them.

The problem is that these models can’t reconcile what they see with what’s actually real.

And from Abu Dhabi to Silicon Valley, a group of researchers from the Institute of Foundation Models at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) is working to fix that. These researchers have their sights set on world models, or those that make decisions and act on the world around them.

“Our world model is designed to let AI understand and imagine how the world works — not just by seeing what’s happening, but by predicting what could happen next,” Hector Liu, Director at the Institute of Foundation Models (IFM), Silicon Valley Lab told The Deep View.

As it stands, tech firms are intent on using language to control AI – whether that be via chatbots, video and image generation, or agents. But conventional large language models lack what Stanford University researcher Dr. Fei-Fei Li calls “spatial intelligence,” or the ability to visualize in the way that humans do. These models are only good at predicting what to say or create based on their training data, and are unable to ground what they generate into reality.

This is the main divide between a world model and a video generation model, Liu said: One renders appearance, while the other simulates reality. 

Video generation tools like OpenAI’s Sora, Google’s Veo and xAI’s Grok Imagine can produce visually realistic scenes, but world models are designed to understand and simulate the world at large.

While a video generator creates a scene with no sense of state, a world model maintains an internal understanding of the world around it, and how that world evolves, said Liu.

“It predicts how scenes unfold over time and how they respond to actions or interventions, rather than just what they look like,” Liu said. Rather than just generating a scene, these models are interactive and reactive. If a tree falls in the world model, its virtual stump cracks, and the digital grass is flattened in its wake.

There are several companies currently in the running to create models that understand the world around them. Both Google DeepMind and Nvidia released new versions of their world models in August, for example. 

But MBZUAI’s PAN world model has several advantages over its competitors, said Liu.

  • Rather than working only in narrow domains, MBZUAI’s PAN is trained for generality, said Liu, designed to transfer its knowledge across domains. It does so by combining language, vision and action data into one unified space, enabling broad simulation.
  • The structure of PAN separates “reasoning from perception,” meaning seeing is distinct from thinking, said Liu. That separation provides the technical advantage of observability, preventing PAN from drifting away from real-world physics.

To measure how well PAN understands the world, MBZUAI researchers measure two main factors: long-horizon performance, or the ability to simulate a coherent world over time, and agentic usability. If something is wrong within a world model, the agent that’s working within it goes haywire.

The next step in the development of PAN is to make the model’s “imagination space,” or inner visualization capabilities, more rich and precise. This will allow the model to understand and render worlds in even finer detail. MBZUAI is also expanding beyond just vision understanding, researching modalities such as sound and motion signals, as well as using an agent to test and learn from different scenarios.

“That’s how we move from a model that only imagines the world to one that can actually think and act within it,” said Liu.

Though several developers want to build models that see the world for what it is, these systems are still in very early stages. 

Progress has been made on visual understanding, but humans have more than one sense. For a world model to be truly complete, developing a system with a strong understanding of audio, touch and physical interaction is crucial. The ideal world model not only understands all those modalities but can also create simulations in any of them. “If a modality is missing, the simulation will always be incomplete,” said Liu.

Creating an AI that can understand all of those modalities is to create a model that senses and understands almost entirely like a human does. But doing so comes with significant technical barriers, including access to substantial amounts of complex training data and potentially the need for entirely new model architecture.

But surpassing those barriers could have far-reaching implications, said Liu.

In robotics, these models can prevent the need for intense monitoring and training, limiting “real-world trial and error,” Liu said. Instead, the models that operate robots could be trained in simulations, perfecting actions and discovering mistakes before they even get onto factory floors or into homes. In self-driving cars, meanwhile, a world model could allow an autonomous driving model to rehearse thousands of traffic scenarios before the rubber hits the road.

And the possibilities extend beyond the self-piloted machines we have available today, with research being done in domains as sports strategy to simulate player outcomes, animation and digital art to design and create worlds, said Liu. More discoveries could emerge once these models are actually in the hands of the people.

“In the end, it’s about creating AI that doesn’t just react to the world but can think ahead.”