Google Unveils LM-Nav, A Robotic Navigation System, In Association With Universities

Published: 03 August 2022 | Last Updated: 03 August 20222387
The main solution to this type of task has previously been for robots to be trained to understand the text by learning from a large number of similar tasks annotated with textual instructions. However, this approach requires annotated data, which in turn can be costly and ultimately hinders the use of robots in a wider range of applications.
We present Large Model Navigation (LM-Nav) --- a method that combines the strengths of large, pre-trained models of language, images, and visual navigation, for the task of embodied instruction following.

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action (Summary)

One of the biggest challenges in the field of robotics is to enable robots to understand human commands in real-time and react immediately to new commands and changes in the environment, plan new tasks in real-time and fulfil human requirements.

 

For example, in the task of navigating to a destination in accordance with human instructions, the robot must not only understand human instructions, i.e. have natural language understanding, but must also be able to perceive its surroundings in real-time, i.e. have visual recognition, and must also be able to 'translate' between the verbal instructions and the environment it perceives in order to follow human instructions. It also needs to be able to "translate" the language commands to the environment it perceives in order to reach its destination as instructed by humans.


The main solution to this type of task has previously been for robots to be trained to understand the text by learning from a large number of similar tasks annotated with textual instructions. However, this approach requires annotated data, which in turn can be costly and ultimately hinders the use of robots in a wider range of applications.

 

Recently, a growing body of research has shown the feasibility of a new approach to training robots to learn vision-based navigation from large, unlabelled datasets without prior labelling, by means of a self-supervised training goal-conditioning strategy. Moreover, the has better scalability and robustness.

 

Inspired by this idea, Google researchers have developed the LM-Nav system, a large model navigation system that combines the advantages of the two approaches described above and utilises the capabilities of pre-trained models to allow robotic navigation systems to understand natural language commands and fulfil task requirements through their self-supervised systems, even when the navigation data is not annotated by any user.

 

Of particular importance is the powerful generalisation capability of the pre-trained language and visual language models within the system, which allows the robot to understand and execute more complex high-level instructions.

 

The paper, "LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action", was recently published on the arXiv. arXiv, with the University of California, Berkeley, and the University of Warsaw, Poland, participating in the research.

 

Google Unveils LM-Nav, A Robotic Navigation System, In Association With Universities 1.png

(Source: arXiv) 


The LM-Nav navigation system consists of three large pre-trained models for language processing, associating images with language, and visual navigation. These are as follows.

 

Firstly, Large language models (LLM) are used for the task of natural language understanding and are trained on a large web-based text corpus to parse user-given text commands into a series of landmarks. the LLM chosen for LM-Nav is the well-known GPT-3 model.

 

Secondly, Vision-and-language models (VLM) can correlate the information expressed by images and text. In navigation tasks, the vision and language models can correlate landmarks in user commands with the robot's visually perceived surroundings. The visual and linguistic model chosen for this system is described as the CLIP model from the US artificial intelligence research company OpenAI.

 

Thirdly, visual navigation models (VNM) are used to navigate directly from its visual observations, correlating images with the actions performed afterwards in time, and the LM-Nav system has chosen the target condition model ViNG from the Californian AI company DeepAI as its visual navigation model.

 

Google Unveils LM-Nav, A Robotic Navigation System, In Association With Universities 2.png

Figure|LM-Nav navigation system (Source: arXiv)

 

In brief, the main working process of the LM-Nav navigation system is shown in the diagram below.


Google Unveils LM-Nav, A Robotic Navigation System, In Association With Universities 3.png

Figure|Main working process of LM-Nav navigation system (Source: arXiv)


Firstly, the system takes as input the initial observation of the destination environment, as well as the textual instructions given by the user, and uses the three pre-trained models in the system to derive an execution plan: LLM for extracting landmarks from the instructions, VLM for associating textual landmarks with images, and VNM for performing navigation tasks. With these, even in complex environments, the LM-Nav does not require any fine-tuning and executes various user commands based entirely on information observed in real-time vision.

 

To evaluate this system, the researchers deployed and applied the LM-Nav model on the Clearpath Jackal UGV, a robotics research platform. The sensor suite on this platform contains a 6 degree of freedom IMU, a GPS unit for approximate positioning, a wheel encoder, and a 170° field of view front and rear RGB camera for visual observation capture.

 

The experimental process included 20 navigation tests of the system in environments of varying difficulty, with the robot walking a total of over 6 km.

 

Google Unveils LM-Nav, A Robotic Navigation System, In Association With Universities 4.png

FigureApplication of the LM-Nav system, which requires a robot to perform tasks in a real environment according to user instructions (source: arXiv)


As shown above, the underlined part of the text on the left shows the landmarks extracted by the LLM; the signposts marked in the top view in the middle are the result of language-image correlation via the VLM, and on the right is the actual navigation performed according to the VNM.


Google Unveils LM-Nav, A Robotic Navigation System, In Association With Universities 5.png

Figure|Performance results of LM-Nav system compared to GPS-Nav system without VNM (Source: arXiv)


The researchers also introduced performance metrics such as planning success, efficiency, and an average number of manual interventions for comparing LM-Nav's performance to that of GPS-NAV navigation systems. The results show that LM-Nav outperforms the GPS-Nav system in all aspects.

 

Reference.

https://github.com/blazejosinski/lm_nav

 

 

Related News

1Chip Packaging Lead Time Has Grown to 50 Weeks

2Eight Internet of Things (IoT) Trends for 2022

3Demand for Automotive Chips Will Surge 300%

4Volkswagen CFO: Chip Supply Shortage Will Continue Until 2024

5BMW CEO: The Car Chip Problem Will Not Be Solved Until 2023

6Shenzhen: This Year Will Focus on Promoting SMIC and CR Micro 12-inch Project



UTMEL

We are the professional distributor of electronic components, providing a large variety of products to save you a lot of time, effort, and cost with our efficient self-customized service. careful order preparation fast delivery service

Related Articles