A one-armed robot stood in front of a table. On the table sat three plastic figurines: a lion, a whale and a dinosaur.
An engineer gave the robot an instruction: “Pick up the extinct animal.”
The robot whirred for a moment, then its arm extended and its claw opened and descended. It grabbed the dinosaur.
Until very recently, this demonstration, which I witnessed during a podcast interview at Google’s robotics division in Mountain View, Calif., last week, would have been impossible. Robots weren’t able to reliably manipulate objects they had never seen before, and they certainly weren’t capable of making the logical leap from “extinct animal” to “plastic dinosaur.”
Google’s robot being prompted to pick up the extinct animal.CreditCredit…Video by Kelsey Mcclellan For The New York Times
But a quiet revolution is underway in robotics, one that piggybacks on recent advances in so-called large language models — the same type of artificial intelligence system that powers ChatGPT, Bard and other chatbots.
Google has recently begun plugging state-of-the-art language models into its robots, giving them the equivalent of artificial brains. The secretive project has made the robots far smarter and given them new powers of understanding and problem-solving.
I got a glimpse of that progress during a private demonstration of Google’s latest robotics model, called RT-2. The model, which is being unveiled on Friday, amounts to a first step toward what Google executives described as a major leap in the way robots are built and programmed.
“We’ve had to reconsider our entire research program as a result of this change,” said Vincent Vanhoucke, Google DeepMind’s head of robotics. “A lot of the things that we were working on before have been entirely invalidated.”
“A lot of the things that we were working on before have been entirely invalidated,” said Vincent Vanhoucke, head of robotics at Google DeepMind.Credit…Kelsey McClellan for The New York Times
Robots still fall short of human-level dexterity and fail at some basic tasks, but Google’s use of A.I. language models to give robots new skills of reasoning and improvisation represents a promising breakthrough, said Ken Goldberg, a robotics professor at the University of California, Berkeley.
“What’s very impressive is how it links semantics with robots,” he said. “That’s very exciting for robotics.”
To understand the magnitude of this, it helps to know a little about how robots have conventionally been built.
For years, the way engineers at Google and other companies trained robots to do a mechanical task — flipping a burger, for example — was by programming them with a specific list of instructions. (Lower the spatula 6.5 inches, slide it forward until it encounters resistance, raise it 4.2 inches, rotate it 180 degrees, and so on.) Robots would then practice the task again and again, with engineers tweaking the instructions each time until they got it right.
This approach worked for certain, limited uses. But training robots this way is slow and labor-intensive. It requires collecting lots of data from real-world tests. And if you wanted to teach a robot to do something new — to flip a pancake instead of a burger, say — you usually had to reprogram it from scratch.
Partly because of these limitations, hardware robots have improved less quickly than their software-based siblings. OpenAI, the maker of ChatGPT, disbanded its robotics team in 2021, citing slow progress and a lack of high-quality training data. In 2017, Google’s parent company, Alphabet, sold Boston Dynamics, a robotics company it had acquired, to the Japanese tech conglomerate SoftBank. (Boston Dynamics is now owned by Hyundai and seems to exist mainly to produce viral videos of humanoid robots performing terrifying feats of agility.)
In recent years, researchers at Google had an idea. What if, instead of being programmed for specific tasks one by one, robots could use an A.I. language model — one that had been trained on vast swaths of internet text — to learn new skills for themselves?
”We started playing with these language models around two years ago, and then we realized that they have a lot of knowledge in them,” said Karol Hausman, a Google research scientist. “So we started connecting them to robots.”
Google’s first attempt to join language models and physical robots was a research project called PaLM-SayCan, which was revealed last year. It drew some attention, but its usefulness was limited. The robots lacked the ability to interpret images — a crucial skill, if you want them to be able to navigate the world. They could write out step-by-step instructions for different tasks, but they couldn’t turn those steps into actions.
Google’s new robotics model, RT-2, can do just that. It’s what the company calls a “vision-language-action” model, or an A.I. system that has the ability not just to see and analyze the world around it, but to tell a robot how to move.
It does so by translating the robot’s movements into a series of numbers — a process called tokenizing — and incorporating those tokens into the same training data as the language model. Eventually, just as ChatGPT or Bard learns to guess what words should come next in a poem or a history essay, RT-2 can learn to guess how a robot’s arm should move to pick up a ball or throw an empty soda can into the recycling bin.
“In other words, this model can learn to speak robot,” Mr. Hausman said.
In an hourlong demonstration, which took place in a Google office kitchen littered with objects from a dollar store, my podcast co-host and I saw RT-2 perform a number of impressive tasks. One was successfully following complex instructions like “move the Volkswagen to the German flag,” which RT-2 did by finding and snagging a model VW Bus and setting it down on a miniature German flag several feet away.
It also proved capable of following instructions in languages other than English, and even making abstract connections between related concepts. Once, when I wanted RT-2 to pick up a soccer ball, I instructed it to “pick up Lionel Messi.” RT-2 got it right on the first try.
The robot wasn’t perfect. It incorrectly identified the flavor of a can of LaCroix placed on the table in front of it. (The can was lemon; RT-2 guessed orange.) Another time, when it was asked what kind of fruit was on a table, the robot simply answered “white.” (It was a banana.) A Google spokeswoman said the robot had used a cached answer to a previous tester’s question because its Wi-Fi had briefly gone out.
RT-2 can learn to guess how a robot’s arm should move to pick up an empty soda can.CreditCredit…Video by Kelsey Mcclellan For The New York Times
Google has no immediate plans to sell RT-2 robots or release them more widely, but its researchers believe these new language-equipped machines will eventually be useful for more than just parlor tricks. Robots with built-in language models could be put into warehouses, used in medicine or even deployed as household assistants — folding laundry, unloading the dishwasher, picking up around the house, they said.
“This really opens up using robots in environments where people are,” Mr. Vanhoucke said. “In office environments, in home environments, in all the places where there are a lot of physical tasks that need to be done.”
Of course, moving objects around in the messy, chaotic physical world is harder than doing it in a controlled lab. And given that A.I. language models frequently make mistakes or invent nonsensical answers — which researchers call hallucination or confabulation — using them as the brains of robots could introduce new risks.
But Mr. Goldberg, the Berkeley robotics professor, said those risks were still remote.
“We’re not talking about letting these things run loose,” he said. “In these lab environments, they’re just trying to push some objects around on a table.”
Google has recently begun plugging state-of-the-art language models into its hardware robots, giving them the equivalent of artificial brains.CreditCredit…Video by Kelsey Mcclellan For The New York Times
Google, for its part, said RT-2 was equipped with plenty of safety features. In addition to a big red button on the back of every robot — which stops the robot in its tracks when pressed — the system uses sensors to avoid bumping into people or objects.
The A.I. software built into RT-2 has its own safeguards, which it can use to prevent the robot from doing anything harmful. One benign example: Google’s robots can be trained not to pick up containers with water in them, because water can damage their hardware if it spills.
If you’re the kind of person who worries about A.I. going rogue — and Hollywood has given us plenty of reasons to fear that scenario, from the original “Terminator” to last year’s “M3gan” — the idea of making robots that can reason, plan and improvise on the fly probably strikes you as a terrible idea.
But at Google, it’s the kind of idea researchers are celebrating. After years in the wilderness, hardware robots are back — and they have their chatbot brains to thank.