ATLAS: Raciocínio Visual Agentivo ou Latente? Uma Palavra é Suficiente para Ambos

Resumo

O raciocínio visual, frequentemente intercalado com estados visuais intermediários, emergiu como uma direção promissora na área. Uma abordagem direta é gerar imagens por meio de modelos unificados durante o raciocínio, mas isso é computacionalmente custoso e arquiteturalmente não trivial. Alternativas recentes incluem raciocínio agentivo por meio de código ou chamadas de ferramentas, e raciocínio latente com embeddings ocultos aprendíveis. No entanto, métodos agentivos incorrem em latência de troca de contexto devido à execução externa, enquanto métodos latentes carecem de generalização de tarefas e são difíceis de treinar com paralelização autorregressiva. Para combinar seus pontos fortes enquanto mitigamos suas limitações, propomos o ATLAS, uma estrutura na qual uma única 'palavra' discreta, denominada token funcional, serve tanto como uma operação agentiva quanto como uma unidade de raciocínio visual latente. Cada token funcional está associado a uma operação visual internalizada, mas não requer supervisão visual e permanece um token padrão no vocabulário do tokenizador, podendo ser gerado por meio da previsão do próximo token. Esse design evita a geração verbosa de conteúdo visual intermediário, preservando a compatibilidade com o treinamento SFT e RL escaláveis padrão, sem modificações arquiteturais ou metodológicas. Para lidar ainda com a esparsidade dos tokens funcionais durante o RL, introduzimos o GRPO Ancorado por Latente (LA-GRPO), que estabiliza o treinamento ancorando tokens funcionais com um objetivo auxiliar estaticamente ponderado, fornecendo atualizações de gradiente mais fortes. Experimentos extensivos e análises demonstram que o ATLAS alcança desempenho superior em benchmarks desafiadores, mantendo uma clara interpretabilidade. Esperamos que o ATLAS ofereça um novo paradigma que inspire futuras pesquisas em raciocínio visual.

English

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.