証拠調整型クエリクラスタリングによるLLM能力の捕捉

要旨

クエリクラスタリングは、クエリを潜在的な能力要求を反映したグループに分類することで、能力を考慮したLLM評価を可能にする。既存のクラスタリング手法は主にセマンティックな分類体系や埋め込みに依存しているが、表面レベルのセマンティクスと実際のモデル性能との間に乖離があるため、こうした潜在的な能力要求を捉えきれないことが多い。本稿では、ECCアルゴリズムを提案する。ECCは、限定的な事後モデル比較を用いて事前のセマンティック埋め込みを補正し、表面レベルのセマンティクスと潜在的な能力要求とのギャップを埋める。ECCは各クラスタをブラッドリー・テリー・モデルでパラメータ化された能力プロファイルによって特徴づけ、学習可能な混合重みを用いて複合的な能力要求を持つクエリに対応する。これにより、クエリ固有のLLM能力推定を支援する柔軟で能力認識型のクラスタリング構造を共同学習する。大規模な定量的・定性的評価により、ECCはLLMの能力ランキング品質を大幅に向上させ、人手によるラベリングおよび埋め込みベースのベースラインと比較してそれぞれ平均17.64ポイント、18.02ポイントの改善を達成し、クエリルーティングなどの下流タスクにおいても有効性が確認された。

English

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.