사회 인식 로봇을 위한 경량 시각 추론

초록

사람과 공유되는 환경에서 작동하는 로봇은 주변 환경을 탐색, 상호작용, 감지할 뿐만 아니라 역동적이고 종종 예측 불가능한 인간의 행동을 해석하고 반응해야 합니다. 비전-언어 모델(VLM)을 이용한 로봇 인식 및 지시 따르기 능력 향상에 있어 최근 발전이 두드러지고 있으나, 다중 모드 인간-로봇 상호작용(HRI)의 복잡성을 해결하는 데는 여전히 한계가 있습니다. 이러한 과제에 동기를 부여받아, 우리는 VLM 내 LLM과 비전 인코더 간의 순환을 닫는 경량 언어-투-비전 피드백 모듈을 제안합니다. 이 모듈은 이미지 토큰 은닉 상태를 게이트 다층 퍼셉트론(MLP)을 통해 인코더 입력으로 다시 투영하여, 텍스트 문맥 하에서 장면을 재해석하는 두 번째 패스를 유도합니다. 우리는 이 접근법을 시뮬레이션 환경 내 탐색(Habitat), 순차적 장면 설명(Mementos-Robotics), 인간 의도 인식(우리의 HRI 데이터셋)이라는 세 가지 로봇공학 중심 과제에서 평가합니다. 결과는 우리의 방법이 Qwen 2.5 (7B)의 성능을 3.3%(거리 단축), +0.057(설명 점수), +2.93%(정확도) 향상시키며, 이는 3% 미만의 추가 매개변수만으로 달성됨을 보여줍니다. Gemma 3 (4B)과 LLaVA OV 1.5 (4B)는 탐색 결과가 혼재되었으나, 후두 과제에서 각각 +0.111, +0.055 및 +10.81%, +4.79%의 성능 향상을 보였습니다. 코드는 https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics에서 이용 가능합니다.

English

Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by 3.3% (less distance), +0.057 description score, and +2.93% accuracy, with less than 3% extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains +0.111,+0.055 and +10.81%,+4.79% on the latter two tasks. Code is available at https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics

사회 인식 로봇을 위한 경량 시각 추론

Lightweight Visual Reasoning for Socially-Aware Robots

초록

Support