Franca:可擴展視覺表徵學習的嵌套套娃聚類
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
July 18, 2025
作者: Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano
cs.AI
摘要
我們推出Franca(發音為Fran-ka):自由者;這是首個完全開源(數據、代碼、權重)的視覺基礎模型,其性能在多數情況下不僅匹配甚至超越了當前最先進的專有模型,如DINOv2、CLIP、SigLIPv2等。我們的方法基於一個受Web-SSL啟發的透明訓練流程,並使用公開可用的數據:ImageNet-21K和ReLAION-2B的子集。除了模型發布外,我們還解決了SSL聚類方法中的關鍵限制。現代模型依賴於通過如Sinkhorn-Knopp等聚類算法將圖像特徵分配給大型碼本,但未能考慮到聚類語義中的固有模糊性。為此,我們引入了一種基於嵌套Matryoshka表示的高效參數多頭聚類投影器。這一設計在不增加模型大小的情況下,逐步將特徵細化為更精細的聚類,實現了性能和內存效率的雙重提升。此外,我們提出了一種新穎的位置解耦策略,明確地從密集表示中移除位置偏差,從而改進了語義內容的編碼。這在多個下游基準測試中帶來了持續的性能提升,展示了更乾淨特徵空間的實用性。我們的貢獻為透明、高性能的視覺模型設立了新標準,並為更廣泛的AI社區開闢了一條通往更可重現和泛化的基礎模型之路。代碼和模型檢查點可在https://github.com/valeoai/Franca獲取。
English
We present Franca (pronounced Fran-ka): free one; the first fully open-source
(data, code, weights) vision foundation model that matches and in many cases
surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2,
CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training
pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and
a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in
SSL clustering methods. While modern models rely on assigning image features to
large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to
account for the inherent ambiguity in clustering semantics. To address this, we
introduce a parameter-efficient, multi-head clustering projector based on
nested Matryoshka representations. This design progressively refines features
into increasingly fine-grained clusters without increasing the model size,
enabling both performance and memory efficiency. Additionally, we propose a
novel positional disentanglement strategy that explicitly removes positional
biases from dense representations, thereby improving the encoding of semantic
content. This leads to consistent gains on several downstream benchmarks,
demonstrating the utility of cleaner feature spaces. Our contributions
establish a new standard for transparent, high-performance vision models and
open a path toward more reproducible and generalizable foundation models for
the broader AI community. The code and model checkpoints are available at
https://github.com/valeoai/Franca.