ATLAS:代理性還是潛在視覺推理?一個詞足以兼顧兩者
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
May 14, 2026
作者: Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng
cs.AI
摘要
視覺推理經常交織著中間視覺狀態,已成為該領域一個有前景的方向。一個直接的方法是透過統一模型在推理過程中直接生成圖像,但這樣計算成本高昂且在架構上並不簡單。近期的替代方案包括透過程式碼或工具呼叫進行代理推理,以及使用可學習的隱藏嵌入進行潛在推理。然而,代理方法會因外部執行而產生上下文切換延遲,而潛在方法則缺乏任務泛化能力,且難以與自回歸並行化訓練結合。為了結合兩者優勢並減輕其限制,我們提出ATLAS框架,其中單一離散的「詞」(稱為功能標記)同時扮演代理操作與潛在視覺推理單元的角色。每個功能標記都關聯一個內化的視覺操作,但無需視覺監督,同時仍是標記器詞彙中的標準標記,可透過下一個標記預測生成。這種設計避免了冗長的中間視覺內容生成,同時保留了與原始可擴展SFT和RL訓練的相容性,無需修改架構或方法。為進一步解決RL中功能標記的稀疏性問題,我們引入潛在錨點GRPO(LA-GRPO),透過靜態加權輔助目標將功能標記錨定,提供更強的梯度更新,從而穩定訓練。大量實驗與分析表明,ATLAS在具挑戰性的基準測試上達到優越性能,同時保持清晰的解釋性。我們希望ATLAS能提供一個新範式,啟發未來視覺推理研究。
English
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.