RevisEval: 응답 적응 참조를 통한 판사로서의 LLM 개선

초록

최근 연구에서 상당한 노력이 기울여진 결과, LLM-as-a-Judge는 다양한 작업에서 텍스트 생성 품질을 평가하기 위한 인간 평가의 비용 효율적인 대안으로 자리를 잡았다. 그러나 LLM-as-a-Judge와 인간 평가 사이에는 여전히 신뢰성 간격이 남아있다. 한 가지 중요한 이유는 평가 과정에서 안내된 오라클이 부족하다는 것이다. 고전적인 텍스트 평가에서 널리 사용되는 참조의 역할에 영감을 받아, 우리는 응답에 적응된 참조를 통해 새로운 텍스트 생성 평가 패러다임인 RevisEval을 소개한다. RevisEval은 이상적인 참조가 평가해야 하는 응답과 필요한 관련성을 유지해야 한다는 주요 관찰에 기반한다. 구체적으로, RevisEval은 대규모 언어 모델(Large Language Models, LLMs)의 텍스트 수정 능력을 활용하여 응답을 적응적으로 수정한 후 수정된 텍스트를 참조(응답에 적응된 참조)로 취급하여 이후 평가에 활용한다. 광범위한 실험을 통해 RevisEval이 NLG 작업 및 오픈엔드 지시 따르기 작업에서 LLM-as-a-Judge를 사용하는 전통적인 참조 없는 및 참조 기반 평가 패러다임을 능가하는 것을 입증한다. 더 중요한 것은, 우리의 응답에 적응된 참조가 전통적인 참조보다 심지어 LLM-as-a-Judge와 경쟁할 수 있을 정도로 고전적인 텍스트 지표인 BLEU와 BERTScore를 더욱 향상시킬 수 있다는 것이다. RevisEval의 효과적인 편향 감소, 추론 비용의 영향, 그리고 참조 관련성에 대한 영향을 확인하기 위해 상세한 분석도 수행되었다.

English

With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.

RevisEval: 응답 적응 참조를 통한 판사로서의 LLM 개선

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

초록

Support