F1: Een Vision-Language-Action Model dat Begrip en Generatie naar Acties Overbrugt

Samenvatting

Het uitvoeren van taakgestuurde taken in dynamische visuele omgevingen blijft een centrale uitdaging in embodied AI. Bestaande Vision-Language-Action (VLA)-modellen hanteren voornamelijk reactieve state-to-action-mapping, wat vaak leidt tot kortzichtige gedragingen en een gebrek aan robuustheid in dynamische scènes. In dit artikel introduceren we F1, een vooraf getraind VLA-framework dat visuele vooruitziendheid integreert in de besluitvormingspijplijn. F1 maakt gebruik van een Mixture-of-Transformer-architectuur met specifieke modules voor perceptie, vooruitziendheidsgeneratie en controle, waardoor begrip, generatie en acties worden verbonden. Kern van F1 is een next-scale-voorspellingsmechanisme dat doelgerichte visuele vooruitziendheid synthetiseert als expliciete planningsdoelen. Door plausibele toekomstige visuele statussen te voorspellen, herformuleert F1 actiegeneratie als een vooruitziendheidsgestuurd invers dynamisch probleem, waardoor acties mogelijk worden die impliciet visuele doelen bereiken. Om F1 robuuste en generaliseerbare capaciteiten te geven, stellen we een driestappen-trainingsrecept voor op een uitgebreide dataset met meer dan 330k trajecten over 136 diverse taken. Dit trainingsschema verbetert modulair redeneren en rust het model uit met overdraagbare visuele vooruitziendheid, wat cruciaal is voor complexe en dynamische omgevingen. Uitgebreide evaluaties op real-world taken en simulatiebenchmarks tonen aan dat F1 consistent beter presteert dan bestaande benaderingen, met aanzienlijke verbeteringen in zowel taaksuccespercentage als generalisatievermogen.

English

Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.

F1: Een Vision-Language-Action Model dat Begrip en Generatie naar Acties Overbrugt

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Samenvatting

Support