CanViT：能動的視覚基盤モデルに向けて

要旨

能動的コンピュータビジョンは、逐次的で局所的な「一瞥」を通じて効率的で生物学的に妥当な知覚を実現するが、スケーラブルな汎用アーキテクチャと事前学習パイプラインが欠如している。その結果、能動的ビジョン基盤モデル（AVFM）は未開拓のままであった。本研究では、初のタスク非依存・ポリシー非依存のAVFMであるCanViTを提案する。CanViTは、シーン相対RoPEを用いて、網膜トピック的なVision Transformerバックボーンと、空間トピック的なシーン全体の潜在作業空間である「キャンバス」を結合する。この高容量作業記憶との効率的な相互作用は、新規の非対称クロスアテンション機構であるCanvas Attentionによって支援される。我々は「思考」（バックボーンレベル）と「記憶」（キャンバスレベル）を分離し、キャンバス側の自己アテンションと全結合層を排除することで、低遅延の逐次推論と大規模シーンへのスケーラビリティを実現する。ラベルフリーの能動的ビジョン事前学習スキームとして、ポリシー非依存の受動-能動的密潜在蒸留を提案する：ランダムな位置、ズームレベル、長さの低解像度一瞥のシーケンスから、シーン全体のDINOv3埋め込みを再構築する。CanViT-Bをランダム初期化から、1320万のImageNet-21kシーン（従来の能動的モデル比で一桁多い）と10億のランダムな一瞥を用いて、単一のH100上で166時間かけて事前学習した。ADE20Kセグメンテーションにおいて、凍結したCanViT-Bは単一の低解像度一瞥で38.5% mIoUを達成し、推論FLOPsを19.5分の1に抑え、ファインチューニングなしで最高の能動的モデル（27.6%）を上回り、FLOPsまたは入力が同等のDINOv3教師モデルも凌駕した。追加の一瞥を与えられると、CanViT-Bは45.9% ADE20K mIoUに達する。ImageNet-1k分類では、凍結した教師プローブを用いてCanViT-Bは81.2% top-1精度を達成する。CanViTは、より長いロールアウト、より大きなシーン、新しいポリシーへ一般化する。我々の研究は、セマンティックセグメンテーションにおける受動的ビジョンと能動的ビジョンの間の大きな隔たりを埋め、新たな研究軸としてのAVFMの可能性を実証する。

English

Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve low-latency sequential inference and scalability to large scenes. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes -- an order of magnitude more than previous active models -- and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 19.5x fewer inference FLOPs and no fine-tuning, as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B reaches 81.2% top-1 accuracy with frozen teacher probes. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work closes the wide gap between passive and active vision on semantic segmentation and demonstrates the potential of AVFMs as a new research axis.

CanViT：能動的視覚基盤モデルに向けて

CanViT: Toward Active-Vision Foundation Models

要旨

Support