CanViT：迈向主动视觉基础模型的新征程

摘要

主动计算机视觉通过序列化、局部化的扫视机制，有望实现高效且生物合理的感知，但一直缺乏可扩展的通用架构与预训练流程，致使主动视觉基础模型（AVFM）的研究长期空白。我们提出CanViT——首个任务与策略无关的AVFM。该模型采用场景相对旋转位置编码，将视网膜拓扑的视觉Transformer主干与空间拓扑的场景级潜在工作区（即画布）相绑定。通过新型非对称交叉注意力机制"画布注意力"，实现了与高容量工作记忆的高效交互。我们分离了思考（主干层）与记忆（画布层），通过消除画布侧自注意力与全连接层，实现了低延迟序列推理及对大尺度场景的扩展能力。我们提出无需标注的主动视觉预训练方案——策略无关的被动到主动稠密潜在蒸馏：通过随机位置、缩放级别和长度的低分辨率扫视序列，重构场景级的DINOv3嵌入表示。在单张H100显卡上，我们从随机初始化开始对CanViT-B进行了166小时的预训练，使用1320万张ImageNet-21k场景图像（规模超先前主动模型一个数量级）和10亿次随机扫视。在ADE20K分割任务中，冻结的CanViT-B仅凭单次低分辨率扫视即达到38.5% mIoU，以19.5倍推理FLOPs的优势超越最佳主动模型27.6%的表现（且无需微调），并优于FLOPs或输入匹配的DINOv3教师模型。增加扫视次数后，CanViT-B在ADE20K上的mIoU进一步提升至45.9%。在ImageNet-1k分类任务中，采用冻结教师探针的CanViT-B达到81.2% top-1准确率。该模型可泛化至更长序列、更大场景及新策略。我们的研究显著缩小了被动与主动视觉在语义分割领域的性能差距，证明了AVFM作为新研究方向的潜力。

English

Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve low-latency sequential inference and scalability to large scenes. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes -- an order of magnitude more than previous active models -- and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 19.5x fewer inference FLOPs and no fine-tuning, as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B reaches 81.2% top-1 accuracy with frozen teacher probes. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work closes the wide gap between passive and active vision on semantic segmentation and demonstrates the potential of AVFMs as a new research axis.

CanViT：迈向主动视觉基础模型的新征程

CanViT: Toward Active-Vision Foundation Models

摘要

Support