ModelLens: 無数のモデルの中からタスクに最適なものを見つける

要旨

オープンソースのモデルエコシステムには現在、数十万もの事前学習済みモデルが存在するが、新たなデータセットに最適なモデルを選定することはますます困難になっている。新たなモデルや未ベンチマークのデータセットが絶えず登場するため、実務者は両者に関して過去の記録を有していない。既存の手法はこの実環境設定の一部しか扱えない。すなわち、AutoMLや転移可能性推定は、あらかじめ定義された小規模なプールからモデルを選択するか、対象データセットに対して高コストなモデルごとの順伝播計算を必要とし、一方モデルルーティングは候補プールが所与であることを前提とする。本稿では、実環境におけるモデル推薦のための統一フレームワークModelLensを紹介する。我々の鍵となる洞察は、公開リーダーボード上の相互作用は散在してノイズが多いものの、それらが総体として異種評価設定を横断する暗黙のモデル能力地図を描き出しており、そこから直接学習できるほどに情報豊かな信号を提供するという点である。モデル・データセット・メトリクスの組に対して性能を考慮した潜在空間を学習することにより、ModelLensは対象データセット上で候補を実行することなく、未知のデータセット上の未知のモデルをランク付けする。47Kモデルと9.6Kデータセットにわたる162万件の評価記録からなる新たなベンチマークにおいて、ModelLensはメタデータのみに依存するベースラインや、各候補を対象データセット上で実行する必要のあるベースラインを凌駕する。ModelLensが推薦するTop-Kプールは、さらに複数の代表的なルーティング手法を、多様なQAベンチマークにおいて最大81%向上させる。最近公開されたベンチマークに関するケーススタディは、テキストおよび視覚言語タスクの両方への一般化を裏付けている。

English

The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.

ModelLens: 無数のモデルの中からタスクに最適なものを見つける

ModelLens: Finding the Best for Your Task from Myriads of Models

要旨

Support