ChatPaper.aiChatPaper

FRAPPE:通过多未来表征对齐将世界模型融入通用策略

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

February 19, 2026
作者: Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, Donglin Wang
cs.AI

摘要

让视觉语言动作(VLA)模型具备预测环境动态的能力(即世界建模),已被视为提升机器人推理与泛化能力的关键。然而现有方法存在两大问题:1. 训练目标迫使模型过度关注像素级重建,制约了语义学习与泛化能力;2. 推理时对预测未来观测的依赖常导致误差累积。为解决这些挑战,我们提出基于并行渐进扩展的未来表征对齐方法(FRAPPE)。该方法采用两阶段微调策略:中期训练阶段,模型学习预测未来观测的潜在表征;后期训练阶段,通过并行扩展计算负载,同时与多个不同视觉基础模型进行表征对齐。通过显著提升微调效率并降低对动作标注数据的依赖,FRAPPE为增强通用机器人策略的世界认知能力提供了可扩展且数据高效的路径。在RoboTwin基准测试和真实任务上的实验表明,FRAPPE优于现有最优方法,并在长周期与未见场景中展现出强大泛化能力。
English
Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.
PDF31February 21, 2026