多模态机器人操作学习中的触觉-视觉同步感知
Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation
December 10, 2025
作者: Yuyang Li, Yinghan Chen, Zihang Zhao, Puhao Li, Tengyu Liu, Siyuan Huang, Yixin Zhu
cs.AI
摘要
机器人操作既需要丰富的多模态感知能力,也需有效的学习框架以应对复杂现实任务。融合触觉与视觉感知的透皮(STS)传感器展现出卓越的传感潜力,而现代模仿学习为策略获取提供了强大工具。然而,现有STS设计无法实现同步多模态感知,且存在触觉追踪不可靠的问题。此外,如何将这些丰富的多模态信号整合至基于学习的操作流程中仍是待解难题。我们提出具备同步视觉感知与鲁棒触觉信号提取能力的TacThru传感器,以及利用多模态信号进行操作的TacThru-UMI模仿学习框架。该传感器采用全透明弹性体、持久照明、新型标记线与高效追踪算法,学习系统则通过基于Transformer的扩散策略整合多模态信号。在五项现实挑战性任务中的实验表明,TacThru-UMI平均成功率达85.5%,显著优于交替触觉-视觉(66.3%)和纯视觉(55.4%)基线。该系统在薄软物体接触检测、需多模态协同的精密操作等关键场景中表现优异。本研究证明,将同步多模态感知与现代学习框架相结合,可实现更精准、自适应强的机器人操作。
English
Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.