透過證據校準的查詢聚類捕捉LLM能力

摘要

查詢聚類將查詢分組為反映共享潛在能力需求的群組，從而實現具備能力感知的大語言模型評估。現有聚類方法主要依賴於語義分類或嵌入表示，但由於表面語義與實際模型效能之間的錯位，往往無法捕捉此類潛在能力需求。我們提出ECC演算法，該演算法利用有限後驗模型比較來校正先驗語義嵌入，從而彌合表面語義與潛在能力需求之間的鴻溝。ECC通過布萊德利-特里模型參數化的能力輪廓表徵每個聚類，並利用可訓練的混合權重來適應具有混合能力需求的查詢，共同學習一種靈活且具備能力感知的聚類結構，以支持針對特定查詢的大語言模型能力推斷。大量定量與定性評估顯示，ECC顯著提升了大語言模型能力排序的品質，分別比人工標註和基於嵌入的基準方法平均高出17.64和18.02個百分點，且在查詢路由等下游任務中展現出有效性。

English

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.