面向社交机器人的轻量化视觉推理

摘要

在人类共享环境中运行的机器人不仅需要导航、交互和感知周围环境，还必须能解读并应对动态且往往不可预测的人类行为。尽管近期研究显示，利用视觉语言模型（VLM）可提升机器人感知与指令执行能力，但其在处理多模态人机交互（HRI）复杂性方面仍存在局限。基于此挑战，我们提出一种轻量级语言到视觉的反馈模块，通过门控多层感知机（MLP）将图像标记的隐藏状态投影回编码器输入端，促发在文本语境下重新解读场景的二次处理。我们在三项机器人核心任务上评估该方法：模拟环境导航（Habitat）、序列场景描述（Mementos-Robotics）以及人类意图识别（自建HRI数据集）。结果表明，该方法使用不足3%的额外参数，使Qwen 2.5（7B）模型的导航距离减少3.3%、描述得分提升0.057、识别准确率提高2.93%；Gemma 3（4B）和LLaVA OV 1.5（4B）在导航任务中表现不一，但在后两项任务上分别获得+0.111/+0.055的描述分提升和+10.81%/+4.79%的准确率增益。代码已开源：https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics

English

Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by 3.3% (less distance), +0.057 description score, and +2.93% accuracy, with less than 3% extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains +0.111,+0.055 and +10.81%,+4.79% on the latter two tasks. Code is available at https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics

面向社交机器人的轻量化视觉推理

Lightweight Visual Reasoning for Socially-Aware Robots

摘要

Support