同步触觉-视觉感知在机器人多模态操作学习中的应用

摘要

机器人操作既需要丰富的多模态感知能力，也需有效的学习框架以应对复杂现实任务。透皮式（STS）传感器融合触觉与视觉感知，具备前景广阔的传感能力，而现代模仿学习为策略获取提供了强大工具。然而，现有STS设计缺乏同步多模态感知能力，且存在触觉追踪不可靠的问题。此外，如何将这些丰富的多模态信号整合到基于学习的操作流程中仍是开放挑战。我们提出具备同步视觉感知与鲁棒触觉信号提取能力的TacThru传感器，以及利用多模态信号进行操作的TacThru-UMI模仿学习框架。该传感器采用全透明弹性体、持久照明、新型标记线及高效追踪技术，学习系统则通过基于Transformer的扩散策略整合多模态信号。在五项现实挑战性任务上的实验表明，TacThru-UMI平均成功率达85.5%，显著优于交替触觉-视觉（66.3%）和纯视觉（55.4%）基线。该系统在关键场景中表现卓越，包括对薄软物体的接触检测及需要多模态协同的精密操作。本研究证明，将同步多模态感知与现代学习框架相结合，可实现更精准、自适应的机器人操作。

English

Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.

同步触觉-视觉感知在机器人多模态操作学习中的应用

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

摘要

Support