軽量視覚推論による社会的配慮型ロボット

要旨

人間と共有する環境で動作するロボットは、単に周囲を移動・相互作用・検知するだけでなく、動的かつ予測困難な人間の行動を解釈し対応する必要がある。視覚言語モデル（VLM）を用いたロボット知覚と指示追従の向上において近年目覚ましい進展が見られるものの、マルチモーダルな人間-ロボットインタラクション（HRI）の複雑性への対応には依然限界がある。この課題を踏まえ、我々はLLMとVLMの視覚エンコーダ間のループを閉じる軽量な言語-視覚フィードバックモジュールを提案する。このモジュールは画像トークンの隠れ状態をゲート付き多層パーセプトロン（MLP）でエンコーダ入力に再投影し、テキスト文脈に基づいてシーンを再解釈する第二パスを促す。本手法を3つのロボティクス中心タスク（シミュレーション環境（Habitat）でのナビゲーション、連続的シーン記述（Mementos-Robotics）、人間の意図認識（自社HRIデータセット））で評価した結果、Qwen 2.5（7B）では3.3%の移動距離短縮、+0.057の記述スコア向上、+2.93%の精度向上を達成（追加パラメータは3%未満）。Gemma 3（4B）とLLaVA OV 1.5（4B）ではナビゲーション結果は混合したが、後者2タスクでそれぞれ+0.111/+0.055、+10.81%/+4.79%の向上を示した。コードはhttps://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics で公開。

English

Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by 3.3% (less distance), +0.057 description score, and +2.93% accuracy, with less than 3% extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains +0.111,+0.055 and +10.81%,+4.79% on the latter two tasks. Code is available at https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics

軽量視覚推論による社会的配慮型ロボット

Lightweight Visual Reasoning for Socially-Aware Robots

要旨

Support