ChatPaper.aiChatPaper

追寻像素级监督的视觉预训练之路

In Pursuit of Pixel Supervision for Visual Pre-training

December 17, 2025
作者: Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu
cs.AI

摘要

在最基础的层面,像素是我们感知世界时视觉信息的源头。像素包含从底层属性到高层概念等各个层级的信息。自编码器是从像素或其他原始输入中学习表征的经典且历久弥新的范式。本研究表明,基于自编码器的自监督学习至今仍具竞争力,能够为下游任务生成强表征,同时保持简洁性、稳定性和高效性。我们的模型代号"Pixio"是一种增强型掩码自编码器(MAE),具备更具挑战性的预训练任务和更强大的架构。该模型通过自主筛选策略在20亿张网络爬取图像上进行训练,仅需极少量人工标注。Pixio在野外环境下的多种下游任务中均表现优异,包括单目深度估计(如Depth Anything)、前馈式三维重建(即MapAnything)、语义分割及机器人学习,其性能超越或匹配同等规模训练的DINOv3。我们的结果表明,像素空间自监督学习可作为潜在空间方法的有力替代和补充方案。
English
At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
PDF62December 19, 2025