CanViT: 능동 시각 기반 모델을 향하여

초록

능동 컴퓨터 비전은 순차적이고 국소적인 관측(glimpse)을 통해 효율적이고 생물학적으로 타당한 인식을 약속하지만, 확장 가능한 범용 아키텍처와 사전 학습 파이프라인이 부족했습니다. 그 결과, 능동-비전 파운데이션 모델(AVFMs)은 탐구되지 않은 채로 남아 있었습니다. 우리는 첫 번째 태스크 및 정책 불문(task- and policy-agnostic) AVFM인 CanViT를 소개합니다. CanViT는 장면 상대적 RoPE(scene-relative RoPE)를 사용하여 레티노토픽(retinotopic) Vision Transformer 백본과 공간토픽(spatiotopic)의 장면 전체 잠재 작업 공간인 캔버스(canvas)를 결합합니다. 이 높은 용량의 작업 기억과의 효율적인 상호작용은 새로운 비대칭 교차 주의 메커니즘인 Canvas Attention으로 지원됩니다. 우리는 사고(백본 수준)와 기억(캔버스 수준)을 분리하여 캔버스 측의 자기 주의와 완전 연결 계층을 제거함으로써 낮은 지연 순차 추론과 큰 장면으로의 확장성을 달성했습니다. 우리는 레이블 없는 능동 비전 사전 학습 방식인 정책 불문 수동-대-능동 밀집 잠재 지식 증류(policy-agnostic passive-to-active dense latent distillation)를 제안합니다. 이는 무작위 위치, 확대 배율, 길이를 가진 저해상도 관측 시퀀스로부터 장면 전체 DINOv3 임베딩을 재구성하는 것입니다. 우리는 무작위 초기화된 CanViT-B를 1,320만 개의 ImageNet-21k 장면(기존 능동 모델보다 한 차원 더 많음)과 10억 개의 무작위 관측으로 단일 H100에서 166시간 동안 사전 학습했습니다. ADE20K 분할에서 고정된(frozen) CanViT-B는 단일 저해상도 관측으로 38.5% mIoU를 달성하여, 최고의 기존 능동 모델의 27.6%를 추론 FLOPs는 19.5배 더 적으면서 미세 조정 없이 능가하며, FLOP 또는 입력을 맞춘 DINOv3 교사 모델도 능가했습니다. 추가 관측이 주어지면 CanViT-B는 45.9% ADE20K mIoU에 도달합니다. ImageNet-1k 분류에서는 고정된 교사 프로브를 사용한 CanViT-B가 81.2% Top-1 정확도에 도달합니다. CanViT는 더 긴 롤아웃(rollout), 더 큰 장면, 새로운 정책으로 일반화됩니다. 우리의 작업은 의미 분할에서 수동 비전과 능동 비전 사이의 큰 격차를 메우고 AVFM이 새로운 연구 축으로서 갖는 잠재력을 입증합니다.

English

Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve low-latency sequential inference and scalability to large scenes. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes -- an order of magnitude more than previous active models -- and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 19.5x fewer inference FLOPs and no fine-tuning, as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B reaches 81.2% top-1 accuracy with frozen teacher probes. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work closes the wide gap between passive and active vision on semantic segmentation and demonstrates the potential of AVFMs as a new research axis.

CanViT: 능동 시각 기반 모델을 향하여

CanViT: Toward Active-Vision Foundation Models

초록

Support