面向社交感知机器人的轻量化视觉推理

摘要

在共享人类环境中运行的机器人不仅需要导航、交互和感知周围环境，还必须能解读并应对动态且往往不可预测的人类行为。尽管近期研究通过视觉语言模型（VLM）在增强机器人感知与指令遵循能力方面展现出潜力，但其处理多模态人机交互（HRI）复杂性的能力仍存在局限。针对这一挑战，我们提出一种轻量级的语言到视觉反馈模块，通过门控多层感知机（MLP）将图像标记的隐藏状态投影回视觉编码器输入端，实现基于文本上下文对场景的二次解析。我们在三项机器人核心任务上评估该方法：模拟环境导航（Habitat）、序列场景描述（Mementos-Robotics）以及人类意图识别（自建HRI数据集）。实验表明，该方法以不足3%的参数量增长使Qwen 2.5（7B）模型导航距离减少3.3%、描述得分提升0.057、识别准确率提高2.93%；Gemma 3（4B）和LLaVA OV 1.5（4B）在导航任务中表现不一，但在后两项任务中分别获得+0.111/+0.055的描述分提升和+10.81%/+4.79%的准确率增益。代码已开源：https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics

English

Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by 3.3% (less distance), +0.057 description score, and +2.93% accuracy, with less than 3% extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains +0.111,+0.055 and +10.81%,+4.79% on the latter two tasks. Code is available at https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics