Why world models could be the future of AI

Welcome back. The Mohamed bin Zayed University of Artificial Intelligence just launched PAN, a state-of-the-art world model. In today’s newsletter, we dive into all things world models. A quick thank you to the MBZUAI team for giving us the scoop on this, and for sponsoring today’s edition.

IN TODAY’S NEWSLETTER

1. Why world models could be the future of AI

2. Inside MBZUAI’s next-generation world model

3. The daunting task of simulating reality

FRONTIER AI

Why world models could be the future of AI

Today’s most popular AI models are great with words. 

But when given tasks beyond letters and numbers, these models often fail to grasp the world around them. Conventional AI models tend to flounder when faced with real-world tasks, struggling to understand things like physics and causality. It’s why self-driving cars still struggle with edge cases, resulting in safety hazards and traffic law violations. It’s why industrial robots still need tons of training before they can be trusted to not break the things – or people – around them. 

The problem is that these models can’t reconcile what they see with what’s actually real.

And from Abu Dhabi to Silicon Valley, a group of researchers from the Institute of Foundation Models at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) is working to fix that. These researchers have their sights set on world models, or those that make decisions and act on the world around them.

“Our world model is designed to let AI understand and imagine how the world works — not just by seeing what’s happening, but by predicting what could happen next,” Hector Liu, Director at the Institute of Foundation Models (IFM), Silicon Valley Lab told The Deep View.

As it stands, tech firms are intent on using language to control AI – whether that be via chatbots, video and image generation, or agents. But conventional large language models lack what Stanford University researcher Dr. Fei-Fei Li calls “spatial intelligence,” or the ability to visualize in the way that humans do. These models are only good at predicting what to say or create based on their training data, and are unable to ground what they generate into reality.

This is the main divide between a world model and a video generation model, Liu said: One renders appearance, while the other simulates reality. 

Video generation tools like OpenAI’s Sora, Google’s Veo and xAI’s Grok Imagine can produce visually realistic scenes, but world models are designed to understand and simulate the world at large. 

While a video generator creates a scene with no sense of state, a world model maintains an internal understanding of the world around it, and how that world evolves, said Liu

“It predicts how scenes unfold over time and how they respond to actions or interventions, rather than just what they look like,” Liu said. Rather than just generating a scene, these models are interactive and reactive. If a tree falls in the world model, its virtual stump cracks, and the digital grass is flattened in its wake.

TOGETHER WITH IBM

IT complexity is costly and can hinder growth.

Intelligent IT automation is key to taming chaos so you can get the most out of your IT spend. But—where do you focus?

  • Modernize your existing applications and data to align with business goals.

  • Connect your infrastructure to address compliance and security.

  • Integrate your data and middleware to optimize processes and data flows.

  • Infuse intelligence into processes and automate workflows with gen AI and agentic AI.

Optimize your journey to intelligent IT automation while avoiding the pitfalls. Get the insights you need from the IBM Institute for Business Value.

RESEARCH

Inside MBZUAI’s next-generation world model

There are several companies currently in the running to create models that understand the world around them. Both Google DeepMind and Nvidia released new versions of their world models in August, for example. 

But MBZUAI’s PAN world model has several advantages over its competitors, said Liu.

  • Rather than working only in narrow domains, MBZUAI’s PAN is trained for generality, said Liu, designed to transfer its knowledge across domains. It does so by combining language, vision and action data into one unified space, enabling broad simulation. 

  • The structure of PAN separates “reasoning from perception,” meaning seeing is distinct from thinking, said Liu. That separation provides the technical advantage of observability, preventing PAN from drifting away from real-world physics. 

Note: This video, generated by PAN, has been compressed from its original 4K 24FPS to fit this newsletter

To measure how well PAN understands the world, MBZUAI researchers measure two main factors: long-horizon performance, or the ability to simulate a coherent world over time, and agentic usability. If something is wrong within a world model, the agent that’s working within it goes haywire. 

The next step in the development of PAN is to make the model’s “imagination space,” or inner visualization capabilities, more rich and precise. This will allow the model to understand and render worlds in even finer detail. MBZUAI is also expanding beyond just vision understanding, researching modalities such as sound and motion signals, as well as using an agent to test and learn from different scenarios. 

“That’s how we move from a model that only imagines the world to one that can actually think and act within it,” said Liu.

TOGETHER WITH TEMPORAL

62% of teams lose time or revenue to reliability troubles

It doesn’t have to be this way, though.

It’s time to create AI applications that go beyond prototypes and toys to work in the real world.

Easily build ambient and human-in-the-loop agents, and ensure your systems survive reliability challenges with Temporal.

We handle the tricky parts of AI systems so you don't have to.

Get started with Temporal today — build reliable AI systems in minutes

RESEARCH

The daunting task of simulating reality

Though several developers want to build models that see the world for what it is, these systems are still in very early stages. 

Progress has been made on visual understanding, but humans have more than one sense. For a world model to be truly complete, developing a system with a strong understanding of audio, touch and physical interaction is crucial. The ideal world model not only understands all those modalities but can also create simulations in any of them. “If a modality is missing, the simulation will always be incomplete,” said Liu.

Creating an AI that can understand all of those modalities is to create a model that senses and understands almost entirely like a human does. But doing so comes with significant technical barriers, including access to substantial amounts of complex training data and potentially the need for entirely new model architecture.

But surpassing those barriers could have far-reaching implications, said Liu

In robotics, these models can prevent the need for intense monitoring and training, limiting “real-world trial and error,” Liu said. Instead, the models that operate robots could be trained in simulations, perfecting actions and discovering mistakes before they even get onto factory floors or into homes. In self-driving cars, meanwhile, a world model could allow an autonomous driving model to rehearse thousands of traffic scenarios before the rubber hits the road. 

And the possibilities extend beyond the self-piloted machines we have available today, with research being done in domains as sports strategy to simulate player outcomes, animation and digital art to design and create worlds, said Liu. More discoveries could emerge once these models are actually in the hands of the people.

“In the end, it’s about creating AI that doesn’t just react to the world but can think ahead.” 

LINKS

  • OpenAI GPT 5.1: The latest upgrade to OpenAI’s flagship model, including both “instant” and “thinking” versions.

  • Multifactor: Zero-trust authentication, authorization and auditing for agentic AI.

  • Magic Patterns: AI design tool that gives you a prototype within minutes.

  • Code Arena: Code evaluations to test how frontier models plan, build and debug web apps.

  • Webflow App Gen: A system to build production-grade web apps with AI that incorporate your brand, content and vision.

  • ElevenLabs Iconic Marketplace: A platform for licensing popular and legendary voices for AI.

  • Service Now: Staff Research Engineer/Scientist

  • AMD: Applied ML researcher, Generative AI - Advanced Graphics Programs

  • Roblox: Senior Machine Learning Engineer, Ads

  • Waymo: Research Scientist, Prediction & Planning 

GAMES

Which image is real?

Login or Subscribe to participate in polls.

A QUICK POLL BEFORE YOU GO

Can AI truly understand the world without a physical body?

Login or Subscribe to participate in polls.

The Deep View is written by Nat Rubio-Licht, Faris Kojok and The Deep View crew. Please reply with any feedback.

Thanks for reading today’s edition of The Deep View! We’ll see you in the next one.

“The near-perfect condition of the sign, on the ragged pole, was the telltale sign”

“The sign is more universal - not male or female plus the wear and tear at the bottom of the sign is hard to recreate in AI. [The other image’s] rider and the galloping horse on a post in the middle of no where means nothing.”

“The horses feet were level to the ground”

“It must be the hat?”

“At first, I thought it was [the other image] because the rust around the bolts looked real, but I could see a white line at the top R hand side coming off the post that looked off.”

“The horse and cowboy look realistic in the image”

Take The Deep View with you on the go! We’ve got exclusive, in-depth interviews for you on The Deep View: Conversations podcast every Tuesday morning.

If you want to get in front of an audience of 550,000+ developers, business leaders and tech enthusiasts, get in touch with us here.