예측 기반 추론을 통한 통계적으로 신뢰할 수 있는 LLM 기반 순위 평가

초록

PRECISE는 예측 기반 추론(Prediction-Powered Inference)을 확장하여, 적은 수의 인간 레이블 세트와 대규모 LLM 판단 세트를 결합함으로써 순위 평가 지표에 대한 편향 보정 추정치를 생성합니다. PPI는 LLM 판단기의 오류 프로파일과 무관하게 증명 가능한 불편향성을 제공합니다. 이를 Precision@K와 같은 계층적 지표에 적용 가능하게 만들기 위해, 주석이 문서별로 이루어지지만 지표는 질의별로 산출되는 점을 고려하여 출력 공간 계산을 O(2^|C|)에서 O(2^K)로 축소했습니다. ESCI 벤치마크에서 Claude 3 Sonnet 판단으로 30개의 인간 주석을 보강한 결과, Precision@4 추정치의 표준 오차가 4.45에서 3.50으로 감소(21% 상대적 감소)했습니다. 운영 시스템에서 본 프레임워크는 100개의 인간 레이블과 2시간의 도메인 전문가 주석을 바탕으로 세 가지 시스템 변형 중 최적을 정확히 식별했으며, A/B 테스트는 일일 매출 +407bps로 이 순위를 확인했습니다.

English

With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.