Chain-of-Thought Hub: 대규모 언어 모델의 추론 성능을 지속적으로 측정하기 위한 노력

초록

대규모 언어 모델(LLM)이 지속적으로 발전함에 따라, 이들의 평가는 점점 더 중요해지면서도 도전적인 과제가 되고 있습니다. 본 연구는 대규모 언어 모델의 다단계 추론 능력을 평가하기 위한 오픈소스 평가 도구인 Chain-of-Thought Hub를 제안합니다. 우리가 이 설정에 관심을 가지는 이유는 두 가지입니다: (1) GPT와 PaLM 모델 패밀리의 동작을 통해, 복잡한 추론이 약한 LLM과 강한 LLM을 구분하는 주요 차별점이 될 가능성이 높다는 것을 관찰했기 때문입니다; (2) 대규모 언어 모델이 차세대 컴퓨팅 플랫폼이 되고 LLM 기반의 새로운 애플리케이션 생태계를 조성할 것으로 예상되며, 이는 자연스럽게 기반 모델이 언어적 및 논리적 연산의 조합을 포함하는 복잡한 작업을 수행할 것을 요구하기 때문입니다. 우리의 접근 방식은 LLM의 진전을 추적하기 위해 도전적인 추론 벤치마크 모음을 컴파일하는 것입니다. 현재 결과는 다음과 같습니다: (1) 모델 규모는 추론 능력과 명확한 상관관계가 있습니다; (2) 2023년 5월 기준으로 Claude-v1.3과 PaLM-2가 GPT-4와 비슷한 수준인 유일한 두 모델이며, 오픈소스 모델은 여전히 뒤처져 있습니다; (3) LLaMA-65B는 code-davinci-002와 유사한 성능을 보이며, 인간 피드백을 통한 강화 학습(RLHF)과 같은 추가 개발이 성공적으로 이루어진다면 GPT-3.5-Turbo에 근접할 수 있는 잠재력이 큽니다. 우리의 결과는 또한 오픈소스 노력이 따라잡기 위해 커뮤니티가 더 나은 기반 모델 구축과 RLHF 탐구에 더 집중할 필요가 있음을 시사합니다.

English

As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.

Chain-of-Thought Hub: 대규모 언어 모델의 추론 성능을 지속적으로 측정하기 위한 노력

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

초록

Support