ATLAS: 에이전틱 또는 잠재적 시각 추론? 하나의 단어로 충분하다

초록

시각적 추론은 종종 중간 시각적 상태와 함께 인터리브(interleave)되면서 해당 분야에서 유망한 방향으로 부상하고 있다. 간단한 접근 방식은 추론 중에 통합 모델을 통해 직접 이미지를 생성하는 것이지만, 이는 계산 비용이 높고 구조적으로 간단하지 않다. 최근 대안으로는 코드나 도구 호출을 통한 에이전트적 추론(agentic reasoning), 학습 가능한 은닉 임베딩을 사용한 잠재 추론(latent reasoning) 등이 있다. 그러나 에이전트적 방법은 외부 실행으로 인한 컨텍스트 전환 지연(context-switching latency)이 발생하고, 잠재 방법은 작업 일반화(task generalization)가 부족하며 자기회귀 병렬화(autoregressive parallelization)로 훈련하기 어렵다. 이러한 각각의 장점을 결합하고 한계를 완화하기 위해, 우리는 ATLAS를 제안한다. 이 프레임워크에서 기능적 토큰(functional token)이라 불리는 단일 이산적 '단어(word)'는 에이전트적 연산이자 잠재 시각 추론 단위로 기능한다. 각 기능적 토큰은 내재화된 시각적 연산과 연관되어 있지만, 시각적 지도(supervision)를 필요로 하지 않으며 토크나이저 어휘 내의 표준 토큰으로 유지되므로, 다음 토큰 예측(next-token prediction)을 통해 생성될 수 있다. 이 설계는 장황한 중간 시각적 콘텐츠 생성을 피하면서도, 구조적 또는 방법론적 수정 없이 기본적인 확장 가능한 SFT(지도 미세 조정) 및 RL(강화 학습) 훈련과의 호환성을 유지한다. RL 중 기능적 토큰의 희소성 문제를 추가로 해결하기 위해, 우리는 잠재 앵커 GRPO(Latent-Anchored GRPO, LA-GRPO)를 도입한다. 이는 기능적 토큰을 정적으로 가중치가 부여된 보조 목적 함수(auxiliary objective)로 고정(anchor)하여 더 강력한 그래디언트 업데이트를 제공함으로써 훈련을 안정화한다. 광범위한 실험과 분석을 통해 ATLAS가 어려운 벤치마크에서 우수한 성능을 달성하면서도 명확한 해석 가능성을 유지함을 입증한다. 우리는 ATLAS가 미래의 시각적 추론 연구에 영감을 주는 새로운 패러다임을 제공하기를 기대한다.

English

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.