그래디언트가 충돌할 때: LLM 평가자를 위한 다중 목적 프롬프트 최적화의 실패 모드

초록

LLM 평가자를 특정 작업이나 도메인에 맞게 사용자화하는 과정은 종종 여러 평가 기준에 걸쳐 프롬프트를 동시에 최적화하는 것을 수반한다. 텍스트 기울기(textual gradient) 방법은 단일 평가자 기준에 대해 이를 자동화하지만, 자연어 비판(critiques)을 생성할 뿐 수치 벡터는 산출하지 않는다. 따라서 다중 작업 학습의 충돌 해결 도구(PCGrad, MGDA)는 다중 목적 텍스트 기울기 설정에 적용되지 않는다. 우리는 손실, 기울기 및 최적화 LLM이 공유하는 교차 작업 정보의 양을 변화시켜 텍스트 기울기 최적화기의 다섯 가지 분해 모드를 테스트한다. 10개 구성 중 6개에서 최적화가 초기 프롬프트보다 개선되지 않는 것을 관찰한다. 기울기 LLM이 여러 기준을 공동으로 처리할 때 기울기 특이도(specificity)는 59% 감소한다(9.0에서 3.7로). 별도로, 작업별 지침을 단일 프롬프트로 단순 결합하면 Spearman의 rho가 -5.3% 저하되는 것을 관찰한다. 이러한 결과는 두 가지 분리 가능한 실패 모드, 즉 최적화 시점의 기울기 희석(optimization-time gradient dilution)과 추론 시점의 지침 간섭(inference-time instruction interference)을 식별하며, 이는 텍스트 피드백을 사용한 다중 목적 평가자 사용자화를 위한 설계 공간을 함께 제약한다.

English

Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) doesn't apply to the multi-objective textual gradient setting. We test five decomposition modes of textual gradient optimizers by varying how much cross-task information the loss, gradient and optimizer LLMs share. In 6 of 10 configurations, we observe that optimization never improves over the initial prompt. Gradient specificity drops by 59% (from 9.0 to 3.7) when the gradient LLM processes multiple criteria jointly. Separately, we observe that naively combining per-task instructions into a single prompt degrades Spearman's rho by -5.3%. These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge customization using textual feedback.