CompassJudger-1:一体化评估模型助力模型评估与演进
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
October 21, 2024
作者: Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen
cs.AI
摘要
高效准确的评估对于大型语言模型(LLMs)的持续改进至关重要。在各种评估方法中,主观评估因其与真实世界使用场景和人类偏好的卓越契合而备受关注。然而,基于人类的评估成本高昂且缺乏可复制性,这使得精确的自动评估器(评判者)在这一过程中至关重要。在本报告中,我们介绍了CompassJudger-1,这是第一个开源的全能评判者LLM。CompassJudger-1是一个通用型LLM,展示了卓越的多功能性。它能够:1. 作为奖励模型执行单一评分和双模型比较;2. 根据指定格式进行评估;3. 生成批评;4. 执行像一般LLM那样的多样化任务。为了在统一设置下评估不同评判者模型的评估能力,我们还建立了JudgerBench,这是一个新的基准,涵盖了各种主观评估任务并涉及广泛的主题。CompassJudger-1为各种评估任务提供了全面的解决方案,同时保持了适应各种需求的灵活性。CompassJudger和JudgerBench均已发布并可供研究社区使用,网址为https://github.com/open-compass/CompassJudger。我们相信通过开源这些工具,我们可以促进合作,加速LLM评估方法的进展。
English
Efficient and accurate evaluation is crucial for the continuous improvement
of large language models (LLMs). Among various assessment methods, subjective
evaluation has garnered significant attention due to its superior alignment
with real-world usage scenarios and human preferences. However, human-based
evaluations are costly and lack reproducibility, making precise automated
evaluators (judgers) vital in this process. In this report, we introduce
CompassJudger-1, the first open-source all-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable
versatility. It is capable of: 1. Performing unitary scoring and two-model
comparisons as a reward model; 2. Conducting evaluations according to specified
formats; 3. Generating critiques; 4. Executing diverse tasks like a general
LLM. To assess the evaluation capabilities of different judge models under a
unified setting, we have also established JudgerBench, a new benchmark
that encompasses various subjective evaluation tasks and covers a wide range of
topics. CompassJudger-1 offers a comprehensive solution for various evaluation
tasks while maintaining the flexibility to adapt to diverse requirements. Both
CompassJudger and JudgerBench are released and available to the research
community athttps://github.com/open-compass/CompassJudger. We believe that by
open-sourcing these tools, we can foster collaboration and accelerate progress
in LLM evaluation methodologies.Summary
AI-Generated Summary