- The Deep View
- Posts
- Physical AI goes from words to worlds
Physical AI goes from words to worlds

Welcome Back. The robots are taking over. Or, at least, they’ll be able to fold our laundry and do our dishes once they learn how to move their fingers properly. The physical AI market is rapidly accelerating as developers and researchers seek new ways to make AI that sees the world as humans do. But as it turns out, making a robot that actually works isn’t an easy feat.
1. Physical AI goes from words to worlds
2. Robots have a data problem
3. Why AGI might need a body
RESEARCH
Physical AI goes from words to worlds

Physical AI might be on the verge of a ChatGPT moment.
Robotics, vision models that turn language into action, and the buzzy, emerging concept of world models have caught the attention of Big Tech, startups and investors alike. Though developing these models comes with a unique set of challenges, physical AI is in a “GPT 1 or 2 phase,” Evan Helda, head of physical AI at Nebius, told The Deep View.
“You have the early signs of a general purpose model, which is like the ChatGPT moment,” Helda told me at the Nebius Robotics and Physical AI Summit on Tuesday. “We’re in that 2018, 2019 period for LLMs. We’re not there yet, but people aren't as caught off guard now. People are trying to get ahead, versus playing catch up.”
According to a report by Crunchbase and tech nonprofit Mind the Bridge released in September, physical AI firms raked in $16.1 billion in the first three quarters of 2025. Some firms have seen standout investments since then, including Figure AI’s $1 billion Series C at a valuation of $39 billion; Physical Intelligence’s $600 million round, propelling it to a $5.6 billion valuation; and Jeff Bezos’ Project Prometheus, which emerged from stealth with $6.2 billion in funding.
Beyond the numbers, physical AI deployments run the gamut from medical deliveries to helicopter piloting during wildfires to weeding fields. Some of AI’s most prominent researchers have shifted their focus from language models to world models, capable of comprehending the world as humans see it.
It’s proof that AI is moving far beyond the “digital domain,” Amit Goel, head of robotics and edge computing ecosystem at Nvidia, said in his keynote on Tuesday.
Constructing these systems is no easy feat. Researchers are starting to create models that understand modalities beyond language and vision. But vectors like space, time, force and touch are all concepts that a physical AI system needs to grasp, said Goel.
“Inherently, physical AI has to be multimodal,” Goel said. “We have to understand all the different modalities in order to make intelligent decisions.”
However, Nebius’ San Jose summit showcased startups working on physical AI capabilities across the stack, from simulation to foundation models to visual intelligence to actual robotic deployment. With researchers undertaking these problems at all angles, the industry is in for “a lot of progress in parallel,” said Brad Porter, co-founder and CEO of Cobot.
“There are enough people now working on it that I think we’re going to see really big breakthroughs in the next few years,” Porter said in a fireside chat on Tuesday. “It just might not be through brute force scaling like LLMs had.”
TOGETHER WITH YOU.COM
Lack of training is one of the biggest reasons AI adoption stalls. This AI Training Checklist from You.com highlights common pitfalls to avoid and necessary steps to take when building a capable, confident team that can make the most out of your AI investment.
What you'll get:
Key steps for building a successful AI training program
Guidance on fostering adoption
A structured worksheet to monitor progress across your organization
Set your AI initiatives on the right track. Get the checklist.
PRODUCTS
Robots have a data problem

Though tech firms have lofty aspirations for a robotic future where general-purpose humanoids walk among us, actually making physical AI systems that work is far from simple.
From the foundation that these machines are built on all the way down to the dexterity of robotic fingers themselves, building physical intelligence systems comes with several sets of problems that large language model development doesn’t face. To accurately reflect and take action on the world around them, these models need to comprehend their surroundings in a way that a chatbot can’t.
“Building a robot is not easy. I often say it’s like raising a kid. You need a village to do that,” Nvidia’s Goel said in his Tuesday keynote.
One of the foundational rules of AI is that a model is only as good as the data that it’s trained on. But that is squarely the issue with a physical model: Getting good data involves more than just scraping the internet. For the models to go beyond words, the data needs to go beyond words, too.
It’s a problem, Ken Goldberg, a UC Berkeley roboticist, dubbed the 100,000-year data gap: the amount of training data needed for general-purpose humanoid robotics vastly outweighs the training needs of modern LLMs. “There's not enough real world data out there,” Nebius’ Helda told The Deep View.
While it’s possible for humans to capture real-world data, Helda noted, it only scales with a human in the loop.
And though deploying robots in the real world can generate a wealth of data, that data only goes as far as the robots that can be deployed, TJ Galda, senior director of product management for Nvidia’s COSMOS world models, told The Deep View. “If you build a brand new robot and you've never deployed it, you've got zero video data, zero LIDAR.”
Additionally, even if a robot or physical AI system only needs to be trained on a specific task, these models still require a vast amount of pretraining data to scale, Lindon Gao, CEO of Dyna Robotics, said during a panel at the Nebius event. “We need to scale pretraining data sets much more drastically so we can cover a wider distribution of data across different types of tasks.”
It’s why synthetic data, or data that mimics the conditions of the real world, generated in world models might be key to closing this gap, said Galda. In addition to generating large datasets to train self-driving cars and other autonomous systems, these systems can simulate rare conditions to better prepare models for anomalies, he said.
For example, while a dash cam might capture a video of a bear running across the road once, a simulation in a world model could replicate this action hundreds or thousands of times, he said.
“Can we build synthetic data that's good enough to train on that's just as good as if you hit record on a camera? Because if we can, then we can scale that problem very quickly,” said Galda.
TOGETHER WITH YOU.COM
AI implementation can go sideways due to unclear goals and lack of skills. Ensure your team is ready to harness the full potential of your AI investment with this AI Training Checklist from You.com. Set your team—and your AI initiatives—up for success in the new year.
HARDWARE
Why AGI might need a body

While a lot of money, time and effort have been poured into making large language models bigger and better, machines that are good with words are just one ingredient in a larger recipe.
LLMs play an important “low-level function” in physical AI and robotics, Helda told The Deep View. Words are how humans can functionally communicate with these systems, and language models can translate those words into action. But human understanding spans far beyond language, he noted.
Major AI researchers like Yann LeCun, Fei-Fei Li and Ilya Sutskever have questioned the idea that scaling large language models will lead to the idealistic vision of artificial general intelligence, and some turned their attention to world models and spatial intelligence as the next frontier.
“How do you really have agency in the world and be truly intelligent – know what you're doing and why you're doing it – unless you have a sense of consequence? A sense of physics?” said Helda. “Until you can understand physics and movement and cause and effect, I don't think you can be truly super intelligent. And LLMs don't have that.”
Giving these machines an embodiment might help them exist beyond just being a “brain in a can,” as RLWRLD CEO Jung-hee Ryu told me last week at the AWS re:Invent conference in Las Vegas. “If we decide to build artificial general intelligence or artificial super intelligence, we should give it a body, an embodiment, to feel real-world situations and sensory information.”
However, while the humanoid form factor comes with the benefit of versatility, “The technology is not there yet. It’s coming, but it’s not there,” Jonathan Hurst, CEO of Agility AI, said last week on a panel at the Nebius event. “It’s going to be a while before you have this generally capable humanoid that can operate in all of those spaces.”
While large language models are always going to be a piece of the puzzle, the question remains of whether or not the industry has put too much faith in these models’ capabilities. Rumblings of an AI bubble have emerged in recent months as LLM developers pour upwards of a trillion dollars into AI infrastructure with little returns to show for it. The largely nascent market for physical intelligence will also be able to leverage this infrastructure. As Hugging Face CEO Clem Delangue put it, we may be in an LLM bubble, rather than an AI bubble broadly.
“You can keep getting better and better LLMs, perhaps, but certainly it's reliant on them being used for something,” Thomas Randall, research director at Info-Tech Research Group, told The Deep View. “There's this race of innovation, but that's now way ahead of how far users can keep up with it.”
LINKS

Video communication firm Zoom is now a frontier AI company
Broadcom reveals Anthropic is its mystery $10 billion customer
ChatGPT “adult mode” to roll out early next year, once age prediction is ready

Tinker: Thinking Machines has made Tinker, its language model API, generally available.
Gemini Audio Updates: Google has added speech-to-speech translation, improved text-to-speech capabilities and complex workflow improvements to its flagship models.

A POLL BEFORE YOU GO
Is physical AI the next ChatGPT? |
The Deep View is written by Nat Rubio-Licht, Jack Kubinec, Jason Hiner, Faris Kojok and The Deep View crew. Please reply with any feedback!

Thanks for reading today’s edition of The Deep View! We’ll see you in the next one.

*Editor’s note: Turns out we were the ones who couldn't tell Fake from Real last time. Corrected version above!

Take The Deep View with you on the go! We’ve got exclusive, in-depth interviews for you on The Deep View: Conversations podcast every Tuesday morning.

If you want to get in front of an audience of 450,000+ developers, business leaders and tech enthusiasts, get in touch with us here.





