LLMベースの評価器のキャリブレーション

要旨

大規模言語モデル（LLM）の言語モデリングと創発能力における最近の進展により、これらは自然言語生成の品質を評価するための参照不要な評価ツールとして有望であり、人間による評価の有能な代替手段となっています。しかし、クローズドソースであることやホストおよびチューニングに高い計算リソースを要することから、既存のLLMベースの評価ツールを人間の判断にさらに適合させるための実践が不足しています。本研究では、AutoCalibrateを提案します。これは、LLMベースの評価ツールを自動的に較正し、人間の選好に合わせるための多段階の勾配不要なアプローチです。人間の選好を明示的にモデル化する代わりに、まずそれらを人間のラベルセットに暗黙的に包含します。次に、言語モデル自体が、少数の例を用いたインコンテキスト学習を活用して、初期の評価基準セットを作成します。この基準セットをさらに較正するために、最良のパフォーマンスを示すものを選択し、自己改良を通じて再作成します。複数のテキスト品質評価データセットでの実験により、較正を通じて専門家の評価との相関が大幅に向上することが示されました。また、包括的な定性分析を通じて、効果的な評価基準の本質に関する洞察と観察が得られました。

English

Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.

LLMベースの評価器のキャリブレーション

Calibrating LLM-Based Evaluator

要旨

Support