ChatPaper.aiChatPaper

思維鏈中樞:持續努力評估大型語言模型的推理表現

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

May 26, 2023
作者: Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot
cs.AI

摘要

隨著大型語言模型(LLMs)不斷被開發,其評估變得越來越重要但也更具挑戰性。本研究提出了「Chain-of-Thought Hub」,這是一個開源的評估套件,用於評估大型語言模型的多步推理能力。我們對這個設定感興趣有兩個原因:(1)從GPT和PaLM模型家族的行為中,我們觀察到複雜推理很可能是較弱和較強LLMs之間的關鍵區別因素;(2)我們預見大型語言模型將成為下一代計算平台,並促進基於LLMs的新應用生態系統的形成,這自然需要基礎模型執行通常涉及語言和邏輯操作組合的複雜任務。我們的方法是編制一系列具有挑戰性的推理基準,以追蹤LLMs的進展。我們目前的結果顯示:(1)模型規模明顯與推理能力相關;(2)截至2023年5月,Claude-v1.3和PaLM-2是唯一與GPT-4可比的兩個模型,而開源模型仍然落後;(3)LLaMA-65B的表現接近code-davinci-002,這表明通過成功的進一步發展,如從人類反饋中進行強化學習(RLHF),它有很大潛力接近GPT-3.5-Turbo。我們的結果還表明,為了追趕開源努力,社區可能應更加關注建立更好的基礎模型和探索RLHF。
English
As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.
PDF20December 15, 2024