Google Unveils LM-Nav, A Robotic Navigation System, In Association With Universities

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action (Summary)
One of the biggest challenges in the field of robotics is to enable robots to understand human commands in real-time and react immediately to new commands and changes in the environment, plan new tasks in real-time and fulfil human requirements.
For example, in the task of navigating to a destination in accordance with human instructions, the robot must not only understand human instructions, i.e. have natural language understanding, but must also be able to perceive its surroundings in real-time, i.e. have visual recognition, and must also be able to 'translate' between the verbal instructions and the environment it perceives in order to follow human instructions. It also needs to be able to "translate" the language commands to the environment it perceives in order to reach its destination as instructed by humans.
The main solution to this type of task has previously been for robots to be trained to understand the text by learning from a large number of similar tasks annotated with textual instructions. However, this approach requires annotated data, which in turn can be costly and ultimately hinders the use of robots in a wider range of applications.
Recently, a growing body of research has shown the feasibility of a new approach to training robots to learn vision-based navigation from large, unlabelled datasets without prior labelling, by means of a self-supervised training goal-conditioning strategy. Moreover, the has better scalability and robustness.
Inspired by this idea, Google researchers have developed the LM-Nav system, a large model navigation system that combines the advantages of the two approaches described above and utilises the capabilities of pre-trained models to allow robotic navigation systems to understand natural language commands and fulfil task requirements through their self-supervised systems, even when the navigation data is not annotated by any user.
Of particular importance is the powerful generalisation capability of the pre-trained language and visual language models within the system, which allows the robot to understand and execute more complex high-level instructions.
The paper, "LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action", was recently published on the arXiv. arXiv, with the University of California, Berkeley, and the University of Warsaw, Poland, participating in the research.
(Source: arXiv)
The LM-Nav navigation system consists of three large pre-trained models for language processing, associating images with language, and visual navigation. These are as follows.
Firstly, Large language models (LLM) are used for the task of natural language understanding and are trained on a large web-based text corpus to parse user-given text commands into a series of landmarks. the LLM chosen for LM-Nav is the well-known GPT-3 model.
Secondly, Vision-and-language models (VLM) can correlate the information expressed by images and text. In navigation tasks, the vision and language models can correlate landmarks in user commands with the robot's visually perceived surroundings. The visual and linguistic model chosen for this system is described as the CLIP model from the US artificial intelligence research company OpenAI.
Thirdly, visual navigation models (VNM) are used to navigate directly from its visual observations, correlating images with the actions performed afterwards in time, and the LM-Nav system has chosen the target condition model ViNG from the Californian AI company DeepAI as its visual navigation model.
Figure|LM-Nav navigation system (Source: arXiv)
In brief, the main working process of the LM-Nav navigation system is shown in the diagram below.
Figure|Main working process of LM-Nav navigation system (Source: arXiv)
Firstly, the system takes as input the initial observation of the destination environment, as well as the textual instructions given by the user, and uses the three pre-trained models in the system to derive an execution plan: LLM for extracting landmarks from the instructions, VLM for associating textual landmarks with images, and VNM for performing navigation tasks. With these, even in complex environments, the LM-Nav does not require any fine-tuning and executes various user commands based entirely on information observed in real-time vision.
To evaluate this system, the researchers deployed and applied the LM-Nav model on the Clearpath Jackal UGV, a robotics research platform. The sensor suite on this platform contains a 6 degree of freedom IMU, a GPS unit for approximate positioning, a wheel encoder, and a 170° field of view front and rear RGB camera for visual observation capture.
The experimental process included 20 navigation tests of the system in environments of varying difficulty, with the robot walking a total of over 6 km.
Figure|Application of the LM-Nav system, which requires a robot to perform tasks in a real environment according to user instructions (source: arXiv)
As shown above, the underlined part of the text on the left shows the landmarks extracted by the LLM; the signposts marked in the top view in the middle are the result of language-image correlation via the VLM, and on the right is the actual navigation performed according to the VNM.
Figure|Performance results of LM-Nav system compared to GPS-Nav system without VNM (Source: arXiv)
The researchers also introduced performance metrics such as planning success, efficiency, and an average number of manual interventions for comparing LM-Nav's performance to that of GPS-NAV navigation systems. The results show that LM-Nav outperforms the GPS-Nav system in all aspects.
Reference.
https://github.com/blazejosinski/lm_nav
Related News
1、Chip Packaging Lead Time Has Grown to 50 Weeks
2、Eight Internet of Things (IoT) Trends for 2022
3、Demand for Automotive Chips Will Surge 300%
4、Volkswagen CFO: Chip Supply Shortage Will Continue Until 2024
5、BMW CEO: The Car Chip Problem Will Not Be Solved Until 2023
6、Shenzhen: This Year Will Focus on Promoting SMIC and CR Micro 12-inch Project
- UTMEL 2024 Annual gala: Igniting Passion, Renewing BrillianceUTMEL18 January 20242684
As the year comes to an end and the warm sun rises, Utmel Electronics celebrates its 6th anniversary.
Read More - Electronic Components Distributor Utmel to Showcase at 2024 IPC APEX EXPOUTMEL10 April 20243512
Utmel, a leading electronic components distributor, is set to make its appearance at the 2024 IPC APEX EXPO.
Read More - Electronic components distributor UTMEL to Showcase at electronica ChinaUTMEL07 June 20242145
The three-day 2024 Electronica China will be held at the Shanghai New International Expo Center from July 8th to 10th, 2024.
Read More - Electronic components distributor UTMEL Stands Out at electronica china 2024UTMEL09 July 20242362
From July 8th to 10th, the three-day electronica china 2024 kicked off grandly at the Shanghai New International Expo Center.
Read More - A Combo for Innovation: Open Source and CrowdfundingUTMEL15 November 20193273
Open source is already known as a force multiplier, a factor that makes a company's staff, financing, and resources more effective. However, in the last few years, open source has started pairing with another force multiplier—crowdfunding. Now the results of this combination are starting to emerge: the creation of small, innovative companies run by design engineers turned entrepreneurs. Although the results are just starting to appear, they include a fresh burst of product innovation and further expansion of open source into business.
Read More
Subscribe to Utmel !
- 38PMACR50KLF10
TT Electronics
- 38WKBAR20KLF20
TT Electronics
- 39PNCBR50KLF20
TT Electronics
- 38PLABR100LF20
TT Electronics
- 38WKABR50KLF10
TT Electronics
- 39WRABR200KLF20
TT Electronics
- 39WRBBR100KLF10
TT Electronics
- 38WKBAR100LF10
TT Electronics
- 38WKAAR20KLF10
TT Electronics
- 39PRACR1MEGLF20
TT Electronics