ModelLens: 방대한 모델 중 작업에 최적인 모델 탐색

초록

오픈소스 모델 생태계에는 현재 수십만 개의 사전 학습 모델이 존재하지만, 새로운 데이터셋에 가장 적합한 모델을 선택하는 작업은 점점 불가능해지고 있다: 새로운 모델과 벤치마킹되지 않은 데이터셋이 지속적으로 등장함에 따라, 실무자들은 어느 쪽에 대해서도 사전 기록을 보유하지 못하게 된다. 기존 접근법은 이러한 실제 환경의 일부만을 처리한다. AutoML과 전이 가능성 추정은 소규모의 사전 정의된 풀에서 모델을 선택하거나, 대상 데이터셋에 대해 모델별로 비용이 많이 드는 순전파 연산을 요구하며, 모델 라우팅은 주어진 후보 풀을 전제로 한다. 본 논문에서는 실제 환경에서의 모델 추천을 위한 통합 프레임워크인 ModelLens를 제안한다. 핵심 통찰은 공개 리더보드 상의 상호작용이 비록 분산되고 잡음이 많지만, 서로 다른 평가 설정에 걸쳐 모델 능력의 암묵적 지도를 집합적으로 추적하며, 이는 직접 학습하기에 충분히 풍부한 신호라는 점이다. ModelLens는 모델-데이터셋-메트릭 튜플에 대해 성능 인식 잠재 공간을 학습함으로써, 대상 데이터셋에서 후보 모델을 실행하지 않고도 보지 못한 데이터셋에 대한 보지 못한 모델의 순위를 매긴다. 4만7천 개의 모델과 9,600개의 데이터셋에 걸친 162만 개의 평가 기록으로 구성된 새로운 벤치마크에서, ModelLens는 메타데이터만 활용하거나 각 후보를 대상 데이터셋에서 실행해야 하는 기준 방법을 능가한다. 또한 ModelLens가 추천하는 Top-K 풀은 다양한 QA 벤치마크에서 여러 대표적인 라우팅 방법의 성능을 최대 81%까지 향상시킨다. 최근 공개된 벤치마크에 대한 사례 연구는 텍스트 및 비전-언어 작업 모두에 대한 일반화 가능성을 추가로 확인시켜 준다.

English

The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.