그들을 통치할 하나: 자연어, 의사 소통, 지각 및 행동을 연결하다

초록

최근 몇 년간 인간-로봇 상호작용 분야의 연구는 복잡한 인간 지시를 이해하고 동적이고 다양한 환경에서 작업을 수행할 수 있는 로봇을 개발하는 데 초점을 맞추었습니다. 이러한 시스템은 개인 보조부터 산업 로봇에 이르기까지 다양한 응용 분야가 있으며, 로봇이 유연하고 자연스럽게 그리고 안전하게 인간과 상호작용하는 중요성을 강조합니다. 본 논문은 대규모 언어 모델 (Large Language Models, LLMs)과 통합된 로봇 작업 계획을 위한 고급 아키텍처를 제시합니다. 우리의 시스템은 자연어로 표현된 명령을 실행 가능한 로봇 작업으로 변환하고 환경 정보를 통합하며 실시간 피드백에 기반한 계획을 동적으로 업데이트하는 것을 목표로 합니다. 계획 모듈은 시스템의 핵심으로, 수정된 ReAct 프레임워크에 포함된 LLMs를 활용하여 사용자 명령을 해석하고 실행합니다. 사전 훈련된 방대한 지식을 활용함으로써 LLMs는 환경 변화에 대한 새로운 지식을 도입할 필요 없이 사용자 요청을 효과적으로 처리할 수 있습니다. 수정된 ReAct 프레임워크는 실시간 환경 인식과 물리적 작업 결과를 제공함으로써 실행 공간을 더욱 향상시킵니다. 강력하고 동적인 의미 지도 표현을 그래프로 결합하여 제어 구성 요소 및 실패 설명과 함께, 이 아키텍처는 로봇의 적응성, 작업 실행 및 공유 및 동적 환경에서 인간 사용자와의 원활한 협업을 강화합니다. 환경과의 지속적인 피드백 루프를 통합함으로써 시스템은 예상치 못한 변화를 수용하기 위해 계획을 동적으로 조정하여 로봇의 작업 수행 능력을 최적화할 수 있습니다. 이전 경험 데이터 세트를 활용하여 실패에 대한 상세한 피드백을 제공할 수 있습니다. 다음 반복의 LLMs 컨텍스트를 업데이트하여 문제를 극복하는 방법에 대한 제안을 제공할 수 있습니다.

English

In recent years, research in the area of human-robot interaction has focused on developing robots capable of understanding complex human instructions and performing tasks in dynamic and diverse environments. These systems have a wide range of applications, from personal assistance to industrial robotics, emphasizing the importance of robots interacting flexibly, naturally and safely with humans. This paper presents an advanced architecture for robotic action planning that integrates communication, perception, and planning with Large Language Models (LLMs). Our system is designed to translate commands expressed in natural language into executable robot actions, incorporating environmental information and dynamically updating plans based on real-time feedback. The Planner Module is the core of the system where LLMs embedded in a modified ReAct framework are employed to interpret and carry out user commands. By leveraging their extensive pre-trained knowledge, LLMs can effectively process user requests without the need to introduce new knowledge on the changing environment. The modified ReAct framework further enhances the execution space by providing real-time environmental perception and the outcomes of physical actions. By combining robust and dynamic semantic map representations as graphs with control components and failure explanations, this architecture enhances a robot adaptability, task execution, and seamless collaboration with human users in shared and dynamic environments. Through the integration of continuous feedback loops with the environment the system can dynamically adjusts the plan to accommodate unexpected changes, optimizing the robot ability to perform tasks. Using a dataset of previous experience is possible to provide detailed feedback about the failure. Updating the LLMs context of the next iteration with suggestion on how to overcame the issue.

그들을 통치할 하나: 자연어, 의사 소통, 지각 및 행동을 연결하다

One to rule them all: natural language to bind communication, perception and action

초록

Support