PandaLM：一個用於LLM指令調整優化的自動評估基準

摘要

調整大型語言模型（LLMs）仍然是一項具有挑戰性的任務，這是因為超參數選擇的複雜性以及評估調整後模型的困難性。為了確定最佳超參數，自動、強大且可靠的評估基準至關重要。然而，建立這樣一個基準並不是一個簡單的任務，因為與評估準確性和隱私保護相關的挑戰。為應對這些挑戰，我們引入了一個名為PandaLM的評判大型語言模型，該模型經過訓練，可以區分出多個LLMs中的優越模型。PandaLM的焦點不僅僅在於對回應的客觀正確性，這是傳統評估數據集的主要焦點。它還涉及到重要的主觀因素，如相對簡潔性、清晰度、遵循指示、全面性和正式性。為確保PandaLM的可靠性，我們收集了一個多樣化的人工標註測試數據集，其中所有上下文都由人類生成，標籤與人類偏好保持一致。我們的結果顯示，PandaLM-7B在我們的測試數據集上以F1分數的形式實現了GPT-3.5評估能力的93.75%，以及GPT-4的88.28%。PandaLM使得LLM的評估更加公平，且成本更低，透過PandaLM調整的模型相較於使用默認Alpaca超參數訓練的對照模型實現了顯著的改進。此外，PandaLM不依賴基於API的評估，因此避免了潛在的數據洩露。PandaLM的所有資源均在https://github.com/WeOpenML/PandaLM 上公開。

English

Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset. PandaLM enables the evaluation of LLM to be fairer but with less cost, evidenced by significant improvements achieved by models tuned through PandaLM compared to their counterparts trained with default Alpaca's hyperparameters. In addition, PandaLM does not depend on API-based evaluations, thus avoiding potential data leakage. All resources of PandaLM are released at https://github.com/WeOpenML/PandaLM.

PandaLM：一個用於LLM指令調整優化的自動評估基準

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

摘要

Support