Franca:嵌套式套娃聚类算法,助力可扩展视觉表征学习
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
July 18, 2025
作者: Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano
cs.AI
摘要
我们推出Franca(发音为Fran-ka):自由之选;这是首个完全开源(数据、代码、权重)的视觉基础模型,其性能不仅媲美,而且在许多情况下超越了当前最先进的专有模型,如DINOv2、CLIP、SigLIPv2等。我们的方法基于一个受Web-SSL启发的透明训练流程,并采用公开可用的数据:ImageNet-21K和ReLAION-2B的子集。除了模型发布,我们还解决了自监督学习(SSL)聚类方法中的关键限制。尽管现代模型依赖通过Sinkhorn-Knopp等聚类算法将图像特征分配到大型码本中,但它们未能充分考虑聚类语义中的固有模糊性。为此,我们引入了一种基于嵌套俄罗斯套娃表示的多头聚类投影器,该设计在保持模型规模不变的同时,逐步将特征细化为更精细的聚类,实现了性能与内存效率的双重提升。此外,我们提出了一种新颖的位置解耦策略,明确地从密集表示中移除位置偏差,从而改善了语义内容的编码。这一系列改进在多个下游基准测试中带来了持续的性能提升,证明了更纯净特征空间的价值。我们的贡献为透明、高性能的视觉模型设立了新标准,并为更广泛的AI社区开辟了通向更可复现、更通用基础模型的道路。代码和模型检查点可在https://github.com/valeoai/Franca获取。
English
We present Franca (pronounced Fran-ka): free one; the first fully open-source
(data, code, weights) vision foundation model that matches and in many cases
surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2,
CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training
pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and
a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in
SSL clustering methods. While modern models rely on assigning image features to
large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to
account for the inherent ambiguity in clustering semantics. To address this, we
introduce a parameter-efficient, multi-head clustering projector based on
nested Matryoshka representations. This design progressively refines features
into increasingly fine-grained clusters without increasing the model size,
enabling both performance and memory efficiency. Additionally, we propose a
novel positional disentanglement strategy that explicitly removes positional
biases from dense representations, thereby improving the encoding of semantic
content. This leads to consistent gains on several downstream benchmarks,
demonstrating the utility of cleaner feature spaces. Our contributions
establish a new standard for transparent, high-performance vision models and
open a path toward more reproducible and generalizable foundation models for
the broader AI community. The code and model checkpoints are available at
https://github.com/valeoai/Franca.