F1：连接理解与生成至行动的视觉-语言-动作模型

摘要

在动态视觉环境中执行语言条件任务仍然是具身人工智能的核心挑战。现有的视觉-语言-动作（VLA）模型主要采用反应式的状态到动作映射，往往导致短视行为和在动态场景中的鲁棒性较差。本文介绍了F1，一个预训练的VLA框架，它将视觉预见生成整合到决策流程中。F1采用了一种混合Transformer架构，包含专门的感知、预见生成和控制模块，从而将理解、生成和行动相连接。F1的核心是采用了一种下一尺度预测机制，以合成目标条件的视觉预见作为明确的规划目标。通过预测可能的未来视觉状态，F1将动作生成重新表述为一个预见引导的逆动力学问题，使得动作能够隐式地实现视觉目标。为了赋予F1强大且可泛化的能力，我们提出了一种三阶段训练方案，使用包含136个多样化任务中超过33万条轨迹的广泛数据集进行训练。这一训练方案增强了模块化推理能力，并赋予模型可迁移的视觉预见能力，这对于复杂和动态环境至关重要。在现实世界任务和仿真基准上的广泛评估表明，F1始终优于现有方法，在任务成功率和泛化能力上均取得了显著提升。

English

Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.

F1：连接理解与生成至行动的视觉-语言-动作模型

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

摘要

Support