ChatPaper.aiChatPaper

螳螂:具備解耦視覺預測能力的多模態視覺-語言-動作模型

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

November 20, 2025
作者: Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng
cs.AI

摘要

视觉-语言-动作(VLA)模型的最新进展表明,视觉信号能有效补充稀疏的动作监督。然而,直接让VLA模型预测高维视觉状态会分散模型容量并导致训练成本过高,而将视觉状态压缩为更紧凑的监督信号又不可避免地引发信息瓶颈。此外,现有方法因忽视语言监督而常存在理解与推理能力不足的问题。本文提出Mantis框架,通过解耦视觉预测(DVF)机制解决上述问题。该框架采用元查询与扩散Transformer(DiT)头相结合的方式,将视觉预测从主干网络中分离。通过残差连接向DiT提供当前视觉状态后,简单的下一状态预测目标可使元查询自动捕捉描述视觉轨迹的潜在动作,从而强化显式动作的学习。这种解耦机制减轻了VLA主干网络的负担,使其能通过语言监督保持理解与推理能力。实验表明,在人类操作视频、机器人演示数据和图文对上预训练后,Mantis在LIBERO基准微调后取得96.7%的成功率,在超越强基线的同时展现出高收敛速度。真实场景评估显示,Mantis在指令遵循能力、未知指令泛化性和推理能力方面均优于主流开源VLA模型π_{0.5}。相关代码与权重已开源以支持社区研究。
English
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms π_{0.5}, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
PDF122December 1, 2025