ModelLens:從眾多模型中為您的任務找到最佳選擇
ModelLens: Finding the Best for Your Task from Myriads of Models
May 8, 2026
作者: Rui Cai, Weijie Jacky Mo, Xiaofei Wen, Qiyao Ma, Wenhui Zhu, Xiwen Chen, Muhao Chen, Zhe Zhao
cs.AI
摘要
開源模型生態系統現已包含數十萬個預訓練模型,然而為新資料集挑選最佳模型日益變得不可行:新模型與未經基準測試的資料集持續湧現,使實務工作者在兩方面皆缺乏可參考的先例。現有方法僅能處理此類「野外場景」的片段:自動機器學習與可遷移性估計只能從小型預設候選池中選取模型,或需在目標資料集上對每個候選模型執行昂貴的前向傳遞;而模型路由則預設給定候選池。我們提出 ModelLens,一個用於野外場景模型推薦的統一框架。核心洞察在於:公開排行榜的互動紀錄雖零散且帶有雜訊,卻能共同勾勒出跨異質評估設定下的模型能力隱含圖譜,其訊號豐富到足以直接學習。透過學習模型—資料集—評量指標三元組上的性能感知潛在空間,ModelLens 能對未見過的模型與未見過的資料集進行排序,無需在目標資料集上執行候選模型。在一個包含 1.62M 筆評估紀錄、橫跨 47K 個模型與 9.6K 個資料集的新基準測試中,ModelLens 超越了僅依賴元數據或需在目標資料集上執行各候選模型的基線方法。其推薦的 Top-K 候選池進一步提升了多種代表性路由方法,在多元問答基準測試中最高提升達 81%。針對近期發布基準測試的個案研究亦確認了其對文字與視覺語言任務的泛化能力。
English
The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.