ATLAS : Raisonnement visuel agentif ou latent ? Un mot suffit pour les deux.

Résumé

Le raisonnement visuel, souvent entrelacé avec des états visuels intermédiaires, est devenu une direction prometteuse dans le domaine. Une approche simple consiste à générer directement des images via des modèles unifiés pendant le raisonnement, mais cela est coûteux en calcul et non trivial sur le plan architectural. Les alternatives récentes incluent le raisonnement agentique via du code ou des appels d'outils, et le raisonnement latent avec des embeddings cachés apprenables. Cependant, les méthodes agentiques entraînent une latence de changement de contexte due à l'exécution externe, tandis que les méthodes latentes manquent de généralisation aux tâches et sont difficiles à entraîner avec la parallélisation autorégressive. Pour combiner leurs forces tout en atténuant leurs limites, nous proposons ATLAS, un cadre dans lequel un seul « mot » discret, appelé jeton fonctionnel, sert à la fois d'opération agentique et d'unité de raisonnement visuel latent. Chaque jeton fonctionnel est associé à une opération visuelle internalisée, mais ne nécessite aucune supervision visuelle et reste un jeton standard dans le vocabulaire du tokeniseur, pouvant être généré via la prédiction du prochain jeton. Cette conception évite la génération verbeuse de contenu visuel intermédiaire, tout en préservant la compatibilité avec l'entraînement standard et évolutif par SFT et RL, sans modification architecturale ou méthodologique. Pour remédier à la parcimonie des jetons fonctionnels pendant le RL, nous introduisons le GRPO à ancrage latent (LA-GRPO), qui stabilise l'entraînement en ancrant les jetons fonctionnels avec un objectif auxiliaire pondéré statiquement, fournissant des mises à jour de gradient plus fortes. Des expériences et analyses approfondies démontrent qu'ATLAS atteint des performances supérieures sur des benchmarks difficiles tout en maintenant une interprétabilité claire. Nous espérons qu'ATLAS offre un nouveau paradigme inspirant la future recherche en raisonnement visuel.

English

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.