F1：一個連接理解與生成至行動的視覺-語言-動作模型

摘要

在動態視覺環境中執行語言條件任務，仍然是具身人工智慧（Embodied AI）領域的核心挑戰。現有的視覺-語言-動作（Vision-Language-Action, VLA）模型主要採用反應式的狀態到動作映射，這往往導致短視行為以及在動態場景中的魯棒性不足。本文提出了一種名為 F1 的預訓練 VLA 框架，該框架將視覺預測生成整合到決策流程中。F1 採用了一種混合變換器（Mixture-of-Transformer）架構，並配備了專用模組來處理感知、預測生成和控制，從而橋接了理解、生成與動作。其核心在於，F1 使用了一種下一尺度預測機制，以合成目標條件化的視覺預測作為明確的規劃目標。通過預測可能的未來視覺狀態，F1 將動作生成重新表述為一個預測引導的逆動力學問題，從而實現了隱含達成視覺目標的動作。為了賦予 F1 強大且可泛化的能力，我們提出了一種三階段訓練方案，並在包含 136 種多樣化任務、超過 33 萬條軌跡的廣泛數據集上進行訓練。這一訓練方案增強了模組化推理能力，並使模型具備了可遷移的視覺預測能力，這對於複雜且動態的環境至關重要。在真實世界任務和模擬基準上的廣泛評估表明，F1 在任務成功率和泛化能力方面均顯著優於現有方法，取得了實質性的提升。

English

Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.

F1：一個連接理解與生成至行動的視覺-語言-動作模型

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

摘要

Support