ModelLens: 从万千模型中为你的任务甄选最优
ModelLens: Finding the Best for Your Task from Myriads of Models
May 8, 2026
作者: Rui Cai, Weijie Jacky Mo, Xiaofei Wen, Qiyao Ma, Wenhui Zhu, Xiwen Chen, Muhao Chen, Zhe Zhao
cs.AI
摘要
开源模型生态系统如今包含数十万个预训练模型,然而为新数据集挑选最佳模型变得越来越不可行:新模型和未经基准测试的数据集不断涌现,导致实践者在这两方面都缺乏先验记录。现有方法仅能处理这种野外场景的片段:自动机器学习和可迁移性估计从预定义的小型池中选择模型,或需要在目标数据集上进行昂贵的逐模型前向传播,而模型路由则预设了一个给定的候选池。我们提出ModelLens,一个用于野外模型推荐的统一框架。我们的关键洞察是:公开排行榜上的交互数据,尽管分散且嘈杂,却共同勾勒出跨异构评估设置的模型能力隐含图谱,这一信号足够丰富,可以直接从中学习。通过学习模型-数据集-指标三元组的性能感知潜在空间,ModelLens能够在无需在目标数据集上运行候选模型的情况下,对未见过的模型在未见过的数据集上进行排名。在一个包含162万条评估记录、涵盖4.7万个模型和9600个数据集的新基准上,ModelLens超越了那些仅依赖元数据或需要在目标数据集上运行每个候选模型的基线方法。其推荐的Top-K池进一步将多个代表性路由方法在多种问答基准上的性能提升高达81%。对近期发布基准的案例研究进一步证实了其对文本和视觉-语言任务的泛化能力。
English
The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.