ChatPaper.aiChatPaper

RealCritic: 迎向以效能為導向的語言模型評估

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

January 24, 2025
作者: Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin
cs.AI

摘要

批評對於提升大型語言模型(LLMs)的性能至關重要,它可以通過識別缺陷並提出改進建議,實現自我提升,並為他人提供建設性反饋。然而,評估LLMs的批評能力面臨著重大挑戰,這是由於任務的開放性特質所導致的。在這項工作中,我們引入了一個新的基準,旨在評估LLMs的批評能力。與現有的基準不同,這些基準通常以開放式迴路方式運作,我們的方法採用了一種閉迴路方法,評估從批評中生成的更正的質量。此外,該基準還包括自我批評、交叉批評和迭代批評等功能,這些功能對於區分先進推理模型與更為傳統模型的能力至關重要。我們使用八個具有挑戰性的推理任務來實現這個基準。我們有幾個有趣的發現。首先,盡管在直接思維鏈生成方面表現出可比性,但在所有批評情境下,傳統LLMs明顯遠遠落後於基於先進推理的o1-mini模型。其次,在自我批評和迭代批評設置中,相對於其基準能力,傳統LLMs甚至可能表現不佳。我們希望這個基準可以成為指導未來進展的寶貴資源。代碼和數據可在https://github.com/tangzhy/RealCritic找到。
English
Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at https://github.com/tangzhy/RealCritic.

Summary

AI-Generated Summary

PDF342January 27, 2025