集群與預測潛在圖像區塊以提升遮蔽圖像建模效能
Cluster and Predict Latents Patches for Improved Masked Image Modeling
February 12, 2025
作者: Timothée Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, Piotr Bojanowski
cs.AI
摘要
遮罩圖像建模(Masked Image Modeling, MIM)為自監督表示學習提供了一種極具前景的方法,然而現有的MIM模型仍落後於當前最先進技術。本文中,我們系統性地分析了目標表示、損失函數及架構,進而提出了CAPI——一個基於潛在聚類預測的全新純MIM框架。我們的方法採用了基於聚類的損失函數,該函數訓練穩定,並展現出良好的擴展特性。我們的ViT-L骨幹網絡CAPI,在ImageNet上達到了83.8%的準確率,在ADE20K上實現了32.1%的mIoU,僅使用簡單的線性探測器便顯著超越了以往的MIM方法,並接近了當前最先進技術DINOv2的性能。我們已公開所有代碼和模型。
English
Masked Image Modeling (MIM) offers a promising approach to self-supervised
representation learning, however existing MIM models still lag behind the
state-of-the-art. In this paper, we systematically analyze target
representations, loss functions, and architectures, to introduce CAPI - a novel
pure-MIM framework that relies on the prediction of latent clusterings. Our
approach leverages a clustering-based loss, which is stable to train, and
exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8%
accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes,
substantially outperforming previous MIM methods and approaching the
performance of the current state-of-the-art, DINOv2. We release all our code
and models.Summary
AI-Generated Summary