F1: 理解と生成から行動へと橋渡しする視覚-言語-行動モデル

要旨

動的な視覚環境における言語条件付きタスクの実行は、エンボディードAIにおける中心的な課題として残されている。既存のVision-Language-Action（VLA）モデルは、主に反応的な状態から行動へのマッピングを採用しており、しばしば近視眼的な行動や動的なシーンにおけるロバスト性の低さを引き起こす。本論文では、視覚的予測生成を意思決定パイプラインに統合した事前学習済みVLAフレームワークであるF1を紹介する。F1は、知覚、予測生成、制御のための専用モジュールを備えたMixture-of-Transformerアーキテクチャを採用し、理解、生成、行動を橋渡しする。その中核では、F1は次スケール予測メカニズムを用いて、目標条件付き視覚的予測を明示的な計画目標として合成する。将来の視覚状態を予測することで、F1は行動生成を予測ガイド付き逆ダイナミクス問題として再定式化し、視覚的目標を暗黙的に達成する行動を可能にする。F1にロバストで汎化可能な能力を付与するため、136の多様なタスクにわたる33万以上の軌跡を含む大規模データセット上での3段階のトレーニングレシピを提案する。このトレーニングスキームは、モジュール化された推論を強化し、複雑で動的な環境において重要な転移可能な視覚的予測をモデルに備えさせる。実世界のタスクおよびシミュレーションベンチマークにおける広範な評価により、F1が既存のアプローチを一貫して上回り、タスク成功率と汎化能力の両方で大幅な向上を達成することが示された。

English

Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.

F1: 理解と生成から行動へと橋渡しする視覚-言語-行動モデル

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

要旨

Support