ATLAS: エージェント的視覚推論か、潜在的な視覚推論か？一語で両方を表す

要旨

視覚推論は、しばしば中間的な視覚状態と連動しながら進められ、この分野で有望な方向性として注目されている。単純なアプローチとして、推論中に統一モデルを通じて直接画像を生成する方法があるが、これは計算コストが高く、アーキテクチャ上も容易ではない。近年の代替手法としては、コードやツール呼び出しによるエージェント型推論、および学習可能な隠れ埋め込みを用いた潜在推論が挙げられる。しかし、エージェント型手法は外部実行によるコンテキスト切り替えのレイテンシを伴い、潜在型手法はタスク汎化に欠け、自己回帰的並列化を用いた学習が困難である。これらの強みを組み合わせつつ限界を緩和するために、我々はATLASを提案する。これは、機能トークンと呼ばれる単一の離散的な「単語」が、エージェント的操作と潜在的な視覚推論ユニットの両方として機能するフレームワークである。各機能トークンは内在化された視覚的操作と関連付けられているが、視覚的な教師信号を必要とせず、トークナイザの語彙に含まれる標準トークンであり、次トークン予測によって生成可能である。この設計により、冗長な中間視覚コンテンツの生成を回避しつつ、通常のスケーラブルなSFTやRL訓練との互換性を、アーキテクチャや方法論の変更なしに維持する。さらに、RL中の機能トークンのスパース性に対処するため、我々はLatent-Anchored GRPO（LA-GRPO）を導入する。これは、静的に重み付けされた補助目的関数で機能トークンをアンカーし、より強力な勾配更新を提供することで訓練を安定化する。広範な実験と分析により、ATLASが難しいベンチマークで優れた性能を達成し、明確な解釈可能性を維持することが示された。ATLASが将来の視覚推論研究に新たなパラダイムを提供することを期待する。

English

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.