思維鏈中樞：持續努力評估大型語言模型的推理表現

摘要

隨著大型語言模型（LLMs）不斷被開發，其評估變得越來越重要但也更具挑戰性。本研究提出了「Chain-of-Thought Hub」，這是一個開源的評估套件，用於評估大型語言模型的多步推理能力。我們對這個設定感興趣有兩個原因：（1）從GPT和PaLM模型家族的行為中，我們觀察到複雜推理很可能是較弱和較強LLMs之間的關鍵區別因素；（2）我們預見大型語言模型將成為下一代計算平台，並促進基於LLMs的新應用生態系統的形成，這自然需要基礎模型執行通常涉及語言和邏輯操作組合的複雜任務。我們的方法是編制一系列具有挑戰性的推理基準，以追蹤LLMs的進展。我們目前的結果顯示：（1）模型規模明顯與推理能力相關；（2）截至2023年5月，Claude-v1.3和PaLM-2是唯一與GPT-4可比的兩個模型，而開源模型仍然落後；（3）LLaMA-65B的表現接近code-davinci-002，這表明通過成功的進一步發展，如從人類反饋中進行強化學習（RLHF），它有很大潛力接近GPT-3.5-Turbo。我們的結果還表明，為了追趕開源努力，社區可能應更加關注建立更好的基礎模型和探索RLHF。

English

As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.

思維鏈中樞：持續努力評估大型語言模型的推理表現

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

摘要

Support