The race to give AI a body

Nat Rubio-Licht

The robots are doing backflips

The robots are forming their hands into fists. They’re grasping pencils and writing. They’re packing boxes and moving things from one place to another. They’re folding laundry, doing dishes, and organizing rooms. They’re doing kung fu and dancing about as awkwardly as a teenage boy at a middle school dance. 

While we’re nowhere near these machines looking like Sophie Thatcher in Companion, more and more, they are taking our shape, moving the way we do and performing our actions. And with AI further nurturing expectations, humanoids have become one of the most sought-after form factors among roboticists and AI builders. 

But with non-humanoid systems already deployed everywhere from food delivery to warehouses to hospitals, you might be asking yourself the same question I keep circling back to: Why do robots need to have two legs, two arms and a head? What is the point of making them look like us? 

Largely, the goal is to create the everything machine: A robot with a generalist brain that can take on any task, enter any environment, or interact with anything in the same way a human could. In that way, the race to build a working humanoid robot mirrors the race to build AGI and superintelligence. 

However, humanoids may just be the beginning. Experts in the space believe that humanoids could be the shape that unlocks a broader platform shift, allowing developers to build applications for a unified form factor. That, in turn, could broaden the horizons for physical AI, a category that Nvidia CEO Jensen Huang has repeatedly said is due for its “ChatGPT moment.” 

In an interview with The Deep View at Nvidia’s annual GTC conference, Amit Goel, the company’s head of robotics and edge computing ecosystem, compared the opportunity of humanoids to that of smartphones. When they first hit the market, millions of developers emerged to create applications, because smartphones offered “a platform on which all of these things could land.” 

“That has not been possible for robotics, because every robot looks different. Everybody's form factor is different,” said Goel. “But with the humanoid form factor, there is an opportunity to do that platform shift, where we can have millions of developers all over the world just building applications.” 

A world made for humans

It’s evident that investors see significant promise in the humanoid industry, with several startups raising millions in funding and exceeding the billion-dollar mark in valuation over the past year.  

In September, Figure AI raised more than $1 billion in Series C funding from the likes of Nvidia, Intel, Salesforce and LG, bringing its valuation to $39 billion, to fund its robots capable of packing orders, building cars or cleaning your living room. Meanwhile, Sunday Robotics, another household humanoid company, closed a $165 million Series B funding round in March, scoring a valuation of $1.15 billion. 

Apptronik, which aims to build training factories for its humanoid robots for retail, manufacturing, and logistics settings, raised $935 million between two Series A rounds as of February, hitting a valuation of $5 billion. And Nvidia-backed Skild AI raised $1.4 billion at a $14 billion valuation in January, to aid in its quest to build a “single, general-purpose brain” for any robot and any task. 

Big tech, of course, is carving out its own piece of the market. Google DeepMind launched a research partnership with Agile Robots in mid-March to build its Gemini models into “helpful and useful” humanoids. Nvidia, meanwhile, announced partnerships with robotics companies and “humanoid pioneers” to build machines using its suite of physical AI models. And Tesla has long kept itself busy with Optimus, betting that its humanoid project will be “the biggest product ever made,” as it stated in a recent post on X. 

“Without question, robotics and physical AI are the largest opportunities for humanity, because the world is so physical,” Deepu Talla, VP of robotics and edge AI at Nvidia, told The Deep View. “Within that, the form factor of a humanoid will be the largest opportunity.” 

Why? Because the world is built for humans, said Talla. Every door is meant for humans to pass through, every chair and couch for us to sit on, every handle for us to turn and every button for us to press. We can’t expect a non-humanoid robot to navigate the world without the proper equipment. Even delivery robots struggle to perform their functions in the face of particularly daunting curbs or crosswalk buttons

“[Humanoids] are actually the best form factor to start off with. We built the world around us for centuries,” said Talla. “[It] all has been designed for human heights and human weights, human proportions.” 

And because the world is built for us, the data that AI systems rely on lives all around us, said Goel. Though every environment is different, from retail to hospitals to factories, the experience of navigating through the world in a human body is a shared reality we all have to maneuver. “We have eight billion people doing stuff,” said Goel. “That's a rich data source.” 

The dexterity gap

The developers of these machines are keen to show off that these robots have skills, whether they're playing tennis, clearing plates after dinner, or putting on boxing gloves and tussling with The Deep View’s editor-in-chief, Jason Hiner

At Nvidia GTC, I got a hands-on demo with IntBot, a robotics company building “social intelligence” for robots, aiming to deploy humanoids with a “human touch” in hotels and airports, Lei Yang, the startup’s CEO, told The Deep View. 

Sitting at the front desk of the San Jose Convention Center, an IntBot-trained humanoid wearing an Nvidia green neckerchief waved its left hand when I commanded it to. Though the robot was unable to shake my hand because it was “not quite that stable” on its feet, it did its level best to give me a thumbs-up with its right hand. 

“We envision in the future, a robot will be in everyone's home, in every workplace,” Yang told me. “It's not a tool, it's basically a member of your family. But for that to happen, the robot needs to understand, perceive, reason and respond like a human being.” 

And in that future, humanoids may come in all different shapes and sizes, Talla said. Once the technology has reached maturity, humanoids may range from tiny, plush robots for children to adult-sized machines that can act as companions, he said. 

For instance, at GTC, Nvidia’s Huang wrapped his keynote by calling a robotic version of Olaf from the Disney movie Frozen on stage with him. Though far from looking human, the Olaf-bot had the essential pieces that make it humanoid: Two legs, two arms and a head. 

The point of these demonstrations isn’t just to put on a good show for rubbernecking conference enthusiasts. It’s to show off what it could be capable of in the right scenarios. If a robot can wave its hand and point left and right, its hand dexterity may be improving enough to screw on bottle caps or package delicate items. If a robot can do a backflip and land on its feet, its leg locomotion is becoming sturdy enough to work 24/7 in a factory. If a robot can fight or react to a tennis ball hurtling its way, it may be able to react to unforeseen situations. If Olaf can walk on stage and chop it up with Jensen Huang, it may eventually be deployable in Disneyland.

And in IntBot’s case, if a robot can make tech-savvy GTC guests smile at the front desk of a conference center, Yang believes these robots might eventually stand a chance against the exhausted airport traveler or the frustrated hotel guest. IntBot is already testing the waters, deploying humanoids in a few hotels in New York and Las Vegas, and launching an exhibition in the San Jose Mineta International Airport in late March. 

“We need the humanoid form factor for human-robot interaction,” said Yang. 

The more humanoids show off, the more expectations rise. And yet, there's still a long journey from here to a time when these robots become a fixture in everyday life, at work and at home. 

As humans, we take for granted how easy it is for us to move our hands, turn our heads or tap our feet, Talla told me. We pick up items and open doors without thinking about how much pressure to apply. We know to grip an egg delicately enough not to break it, and to hold a pull-up bar firmly enough that we don’t lose our grip and fall. 

A robot, however, doesn’t. A humanoid brain needs to be trained on the intricacies of human touch, including all the things that millions of years of evolution have ingrained in us from birth. To put it plainly: Teaching a robot how to do all of these things is really, really hard. 

“I think one big challenge is just how little time it has had to mature,” said Goel. “Something that complex, with so many joints, so many sensors, so much compute, it has to go through a little bit of an iteration cycle before it becomes something that is really robust and reliable and mature.” 

The cost of human mimicry 

Even though theoretically, a wealth of data lives at our fingertips, actually accessing it is a whole other ballgame. Ken Goldberg, a professor at UC Berkeley, coined the issue the “100,000-year data gap” in a paper he published in August. 

Usable, real-world data is sparse compared to language models and can only be bolstered by simulations or by deploying more robots into real-world situations. Safety, meanwhile, can hinder deployment at a significant scale, and is especially challenging in “Longtail, unseen environments,” said Goel. “That’s where the robot has not seen it, [so it] doesn't know what to do.” 

Because of this, expectations may be moving faster than the technology can keep up, Goldberg argues. “We’re just trying to reset expectations so that it doesn’t create a bubble that could lead to a big backlash,” Goldberg told UC Berkeley News

And if or when these robots are ever truly ready to walk among us, societal acceptance of them could be an even more challenging road ahead. How are you meant to treat a humanoid cashier, housekeeper or coworker? Are the stakes the same when soldiers are fighting side by side with machines, or fighting against them?   

There’s a lot of talk of white-collar job loss as a result of AI. But with humanoids, those losses may bleed over to hands-on work. Put yourself in the shoes of a fast food worker, a manufacturing employee, or a factory worker a decade from now. Imagine the loss of dignity you might feel not only being replaced by an AI-powered system, but that system being a humanoid robot that is a vague, faceless approximation of a human being. 

And of course, when you consider this scenario, it’s only natural to arrive back at the tech industry’s quest to develop human-level, artificial general intelligence. While there is little consensus on how to achieve this hazy, ill-defined goal, many of AI’s foremost thinkers have shifted away from the idea that it can be achieved by scaling large language models alone

That's why the likes of Yann LeCun and Fei-Fei Li have turned their attention to spatial intelligence and world models. The purpose of these quests is far greater than their individual use cases. Li and LeCun have both said that LLMs are hitting limits and won't reach human-level AI on their current trajectory. They argue that world models will teach AI to operate the way humans and animals move through the world.

In both AGI and humanoids, the goal is singular: To create a machine that can learn, perform and understand anything as well as humans do. 

It’s this line of thinking that reminds me of a conversation I had at AWS Re:Invent in December with Jung-hee Ryu, the founder of RLWRLD, a company specializing in fine motor skills and dexterity for humanoid robots. Ryu told me that for AI to achieve human-level capabilities, it needs a body. Otherwise, it’s just a “brain in a can.” 

Humanoids are the physical embodiment of the tech industry’s quest to build life itself. Depending on your point of view, that's an incredibly noble ambition, or a remarkably arrogant one. Racing uninhibited toward a future that holds Terminator-like disastrous outcomes in one hand and an end to some of humanity's most enduring problems in the other is nothing if not human in its hubris. The eventual outcome likely depends entirely on how well we plan for it.

If you want to get in front of an audience of 750,000+ developers, business leaders and tech enthusiasts, get in touch with us here.