Home / Chroniques / How AI could humanise robots
Généré par l'IA / Generated using AI
π Science and technology π Society

How AI could humanise robots

anonyme
Edward Johns
Director of the Robot Learning Lab at Imperial College London
Key takeaways
  • Large language models (LLMs) and vision-language models will have a major impact on the future of robotics.
  • Robots can now communicate in natural language, break tasks down into steps and reason using images.
  • However, LLMs do not effectively enable robots to manipulate their environment with their hands or interact with a 3D universe.
  • There is potential for developing robotics using generative AI, such as enabling robots to reason in video and in action.

Watch­ing videos released by robot­ics com­pa­nies like Tes­la and Fig­ure, it could seem like robots will walk into our homes tomor­row, able to exe­cute any com­mand a human asks them to do thanks to advance­ments in large lan­guage mod­els (LLMs). That may be com­ing down the pike, but there are some sub­stan­tial hur­dles to over­come first, says Edward Johns, direc­tor of the Robot Learn­ing Lab at Impe­r­i­al Col­lege London.

We have seen stratospheric advances in the field of large language models. Is that going to boost robotics forward?

Edward Johns. What has hap­pened with large neur­al net­works like lan­guage mod­els and vision-lan­guage mod­els will have a big impact on robot­ics— it’s already help­ing with some of the chal­lenges we’ve had. But we’re cer­tain­ly not going to see a Chat­G­PT-like moment in robot­ics overnight.

LLMs enable oper­a­tors to use nat­ur­al lan­guage when com­mu­ni­cat­ing with the robot, rather than inputting code. That’s use­ful because, ulti­mate­ly, that’s how we want humans to inter­act with them. More impor­tant­ly, these mod­els can unlock a new way of rea­son­ing for robots. Chat­G­PT, for instance, can break tasks down into steps. So, for instance, if you ask it how to make a sand­wich, it will say: you need bread, you need to buy bread, you need to find a shop, get your wal­let, leave the house, etc. That means robots can learn to break down tasks inter­nal­ly, and we know they per­form bet­ter when they have a step-by-step guide.

Over the past few months, we’ve also seen the emer­gence of so-called “vision-lan­guage mod­els” that allow the robot to rea­son not only in lan­guage but with images. That’s impor­tant because, at some point, the robots need to add visu­al infor­ma­tion to their rea­son­ing to nav­i­gate their environment.

What, then, is the limit to using LLMs for robots?

While these are inter­est­ing mod­els to probe, they are solv­ing some of the eas­i­er chal­lenges in robot­ics. They have not had a huge impact in terms of dex­trous manip­u­la­tion, for instance — manip­u­la­tion with hands. That’s real­ly what robot­ics is still miss­ing, and it is real­ly dif­fi­cult. Our hands do thou­sands and thou­sands of com­plex tasks every day.

One prob­lem is that these vision lan­guage mod­els are very good seman­ti­cal­ly, but they won’t be able to help the robot inter­act with a 3D envi­ron­ment, because they are trained on 2D images. For robots to be able to rea­son on that lev­el, they need a huge amount of robot­ics data, which just doesn’t exist. Some peo­ple think this will hap­pen very quick­ly, like the flash­point we have had since the emer­gence of Chat­G­PT — that’s cer­tain­ly what we’re hear­ing in the start­up com­mu­ni­ties. But in the con­text of Chat­G­PT, the data already exist­ed online. It’s going to take a long time to com­pile that robot­ics data.

The kind of abil­i­ties that you see from these com­pa­nies lead­ing robot­ics com­pa­nies like Tes­la and Fig­ure are very impres­sive. For exam­ple, Fig­ure has some inter­est­ing video demos where some­body is con­vers­ing with a robot per­form­ing tasks with its hands. But these robots still need to be trained to do spe­cif­ic tasks using machine learn­ing approach­es such as rein­force­ment learn­ing — where­by you tell the robot to do a task, and you tell it whether it gets it right after a few tries — or the now more pop­u­lar imi­ta­tion learn­ing, where a human demon­strates a task that the robot needs to imitate.

These com­pa­nies are like­ly col­lect­ing thou­sands or pos­si­bly mil­lions of demon­stra­tions to train the robots, which is a time-con­sum­ing and expen­sive process. There’s not a huge amount of sci­en­tif­ic nov­el­ty there. It seems very unlike­ly that those robots will soon be able to per­form any task you want from just a lan­guage com­mand. And none of these com­pa­nies are claim­ing that their robots can do this now. They’re say­ing it will hap­pen in the future. I think it will be years, maybe decades, before that happens.

Wouldn’t the robots be able to gather the data they need and compile it with the information they learn from LLMs? 

I think that’s what some peo­ple bet­ting on. Can we let robots col­lect that data them­selves —mean­ing we leave them in a room overnight with a task and objects — and see what they learned overnight? That’s the type of think­ing used in rein­force­ment learn­ing, and the com­mu­ni­ty has pre­vi­ous­ly moved away from this approach after it realised that it was gen­er­at­ing some frus­trat­ing results that weren’t going any­where. But we could see it swing back in the con­text of these vision-lan­guage models.

There is still scope for sci­en­tif­ic dis­cov­ery in robot­ics. I think there’s still a lot of work to do. For instance, I work on try­ing to get robots to learn a task with­in a few min­utes and with a non-expert teacher.

Do you think LLMs and vision-language models in robotics will just be a flash in the pan?

I don’t think so. It’s true that these new approach­es have only had a minor impact in robot­ics com­pared to old­er meth­ods. How­ev­er, while clas­si­cal engi­neer­ing has reached some­what of a sat­u­ra­tion point, the vision lan­guage mod­els will improve over time.

Cast­ing our minds to the future, for instance, we could see gen­er­a­tive AI mod­els pro­duce a video pre­dict­ing the con­se­quence of its actions. If we can get to that point, then the robot can start to rea­son in video and action — there’s a lot of poten­tial there for robotics.

Interview by Marianne Guenot

Our world explained with science. Every week, in your inbox.

Get the newsletter