How AI could humanise robots
- Large language models (LLMs) and vision-language models will have a major impact on the future of robotics.
- Robots can now communicate in natural language, break tasks down into steps and reason using images.
- However, LLMs do not effectively enable robots to manipulate their environment with their hands or interact with a 3D universe.
- There is potential for developing robotics using generative AI, such as enabling robots to reason in video and in action.
Watching videos released by robotics companies like Tesla and Figure, it could seem like robots will walk into our homes tomorrow, able to execute any command a human asks them to do thanks to advancements in large language models (LLMs). That may be coming down the pike, but there are some substantial hurdles to overcome first, says Edward Johns, director of the Robot Learning Lab at Imperial College London.
We have seen stratospheric advances in the field of large language models. Is that going to boost robotics forward?
Edward Johns. What has happened with large neural networks like language models and vision-language models will have a big impact on robotics— it’s already helping with some of the challenges we’ve had. But we’re certainly not going to see a ChatGPT-like moment in robotics overnight.
LLMs enable operators to use natural language when communicating with the robot, rather than inputting code. That’s useful because, ultimately, that’s how we want humans to interact with them. More importantly, these models can unlock a new way of reasoning for robots. ChatGPT, for instance, can break tasks down into steps. So, for instance, if you ask it how to make a sandwich, it will say: you need bread, you need to buy bread, you need to find a shop, get your wallet, leave the house, etc. That means robots can learn to break down tasks internally, and we know they perform better when they have a step-by-step guide.
Over the past few months, we’ve also seen the emergence of so-called “vision-language models” that allow the robot to reason not only in language but with images. That’s important because, at some point, the robots need to add visual information to their reasoning to navigate their environment.
What, then, is the limit to using LLMs for robots?
While these are interesting models to probe, they are solving some of the easier challenges in robotics. They have not had a huge impact in terms of dextrous manipulation, for instance — manipulation with hands. That’s really what robotics is still missing, and it is really difficult. Our hands do thousands and thousands of complex tasks every day.
One problem is that these vision language models are very good semantically, but they won’t be able to help the robot interact with a 3D environment, because they are trained on 2D images. For robots to be able to reason on that level, they need a huge amount of robotics data, which just doesn’t exist. Some people think this will happen very quickly, like the flashpoint we have had since the emergence of ChatGPT — that’s certainly what we’re hearing in the startup communities. But in the context of ChatGPT, the data already existed online. It’s going to take a long time to compile that robotics data.
The kind of abilities that you see from these companies leading robotics companies like Tesla and Figure are very impressive. For example, Figure has some interesting video demos where somebody is conversing with a robot performing tasks with its hands. But these robots still need to be trained to do specific tasks using machine learning approaches such as reinforcement learning — whereby you tell the robot to do a task, and you tell it whether it gets it right after a few tries — or the now more popular imitation learning, where a human demonstrates a task that the robot needs to imitate.
These companies are likely collecting thousands or possibly millions of demonstrations to train the robots, which is a time-consuming and expensive process. There’s not a huge amount of scientific novelty there. It seems very unlikely that those robots will soon be able to perform any task you want from just a language command. And none of these companies are claiming that their robots can do this now. They’re saying it will happen in the future. I think it will be years, maybe decades, before that happens.
Wouldn’t the robots be able to gather the data they need and compile it with the information they learn from LLMs?
I think that’s what some people betting on. Can we let robots collect that data themselves —meaning we leave them in a room overnight with a task and objects — and see what they learned overnight? That’s the type of thinking used in reinforcement learning, and the community has previously moved away from this approach after it realised that it was generating some frustrating results that weren’t going anywhere. But we could see it swing back in the context of these vision-language models.
There is still scope for scientific discovery in robotics. I think there’s still a lot of work to do. For instance, I work on trying to get robots to learn a task within a few minutes and with a non-expert teacher.
Do you think LLMs and vision-language models in robotics will just be a flash in the pan?
I don’t think so. It’s true that these new approaches have only had a minor impact in robotics compared to older methods. However, while classical engineering has reached somewhat of a saturation point, the vision language models will improve over time.
Casting our minds to the future, for instance, we could see generative AI models produce a video predicting the consequence of its actions. If we can get to that point, then the robot can start to reason in video and action — there’s a lot of potential there for robotics.