ChatPaper.aiChatPaper

基于音频世界模型学习的机器人操作

Learning Robot Manipulation from Audio World Models

December 9, 2025
作者: Fan Zhang, Michael Gienger
cs.AI

摘要

世界模型在机器人学习任务中已展现出卓越性能。此类任务大多天然需要多模态推理能力:例如,仅凭视觉信息来完成水瓶注水任务会存在模糊性或信息缺失,因此必须结合音频的时序演变进行推理,考量其内在物理属性与音高模式。本文提出一种生成式潜变量流匹配模型,用于预测未来音频观测值,当该模型集成至机器人策略时,可实现对长期影响的推理。通过两项需感知真实环境音频或音乐信号的操作任务实验,我们证明了相较于未采用前瞻预测的方法,本系统具有更优异的性能。我们进一步强调,这些任务中成功的机器人动作学习不仅依赖于多模态输入,更关键在于对蕴含内在节奏模式的未来音频状态进行精准预测。
English
World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns.
PDF12December 17, 2025