通过证据校准的查询聚类捕获大语言模型能力

摘要

查询聚类将查询组织成反映共享潜在能力需求的组，从而支持能力感知的大语言模型评估。现有的聚类方法主要依赖语义分类法或嵌入，往往因表层语义与模型实际性能之间的错位而无法捕捉此类潜在能力需求。我们提出ECC算法，该算法利用有限的后验模型比较来校准先验语义嵌入，从而弥合表层语义与潜在能力需求之间的鸿沟。ECC通过参数化Bradley-Terry模型的能力特征描述每个聚类，并利用可训练的混合权重来适应具有混合能力需求的查询，联合学习灵活且具有能力感知的聚类结构，支持对LLM能力的查询特定推断。广泛的定量和定性评估表明，ECC显著提升了LLM能力排序质量，平均分别比人工标注和基于嵌入的基线高出17.64和18.02个百分点，并在查询路由等下游任务中证明了其有效性。

English

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.