追寻视觉预训练中的像素级监督
In Pursuit of Pixel Supervision for Visual Pre-training
December 17, 2025
作者: Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu
cs.AI
摘要
在最基本的層面上,像素是構成我們感知世界的視覺信息之源。像素包含從低階屬性到高階概念的全面信息。自動編碼器作為經典且歷久彌新的範式,始終是從像素或其他原始輸入中學習表徵的重要方法。本研究證實,基於自動編碼器的自監督學習在當下仍具競爭力,能為下游任務生成強健的表徵,同時保持簡潔性、穩定性與高效性。我們代號為"Pixio"的模型是一種增強型掩碼自動編碼器(MAE),具備更具挑戰性的預訓練任務和更強大的架構。該模型通過自主篩選策略在20億網絡圖像上進行訓練,極少依賴人工標註。Pixio在開放環境下的多項下游任務中表現卓越,包括單目深度估計(如Depth Anything)、前饋式三維重建(即MapAnything)、語義分割及機器人學習,其性能超越或持平同等規模訓練的DINOv3。實驗結果表明,像素空間的自監督學習可作為潛在空間方法極具前景的替代方案與補充手段。
English
At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.