基于音频世界模型的机器人操作学习
Learning Robot Manipulation from Audio World Models
December 9, 2025
作者: Fan Zhang, Michael Gienger
cs.AI
摘要
世界模型在机器人学习任务中已展现出卓越性能。此类任务大多天然需要多模态推理能力:例如,仅凭视觉信息来判断水瓶注水过程会存在模糊性或不完整性,这就要求系统能够基于音频的时序演变进行推理,并考量其内在物理特性与音高模式。本文提出一种生成式潜在流匹配模型,用于预测未来的音频观测结果,当该模型被整合到机器人策略中时,可使系统具备推理长期行为后果的能力。通过两项需感知真实环境音频或音乐信号的操作任务实验,我们证明了相较于未采用前瞻预测的方法,本系统具有更优异的性能。我们进一步强调,成功实现这些任务的机器人动作学习不仅依赖于多模态输入,更关键在于对未来音频状态的精准预测,因为这些状态蕴含着内在的节律模式。
English
World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns.