증거 보정 쿼리 클러스터링을 통한 LLM 역량 포착

초록

쿼리 클러스터링은 공유된 잠재 능력 요구를 반영하는 그룹으로 쿼리를 조직화하여, 능력 인식형 LLM 평가를 가능하게 한다. 기존의 클러스터링 방법들은 주로 의미론적 분류체계나 임베딩에 의존하며, 표면 수준의 의미와 실제 모델 성능 간의 불일치로 인해 이러한 잠재 능력 요구를 포착하지 못하는 경우가 많다. 본 논문에서는 ECC 알고리즘을 제안한다. 이는 제한된 사후 모델 비교를 통해 사전 의미 임베딩을 보정하여, 표면 수준 의미와 잠재 능력 요구 간의 간극을 메운다. ECC는 각 클러스터를 브래들리-테리 모델로 매개변수화된 능력 프로파일을 통해 특성화하고, 훈련 가능한 혼합 가중치를 활용하여 혼합된 능력 요구를 가진 쿼리를 수용함으로써, LLM 능력의 쿼리별 추론을 지원하는 유연하고 능력 인식형 클러스터링 구조를 공동으로 학습한다. 광범위한 정량적 및 정성적 평가 결과, ECC는 LLM 능력 순위 평가의 질을 크게 향상시켜, 인간 레이블 기반 및 임베딩 기반 기준선 대비 각각 평균 17.64퍼센트 포인트와 18.02퍼센트 포인트의 성능 향상을 보였으며, 쿼리 라우팅과 같은 하위 작업에서도 효과적임을 입증하였다.

English

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.