선호 트리를 활용한 LLM 추론 일반화 능력 향상

초록

우리는 추론에 최적화된 대규모 언어 모델(LLM) 제품군인 Eurus를 소개합니다. Mistral-7B와 CodeLlama-70B를 미세 조정한 Eurus 모델은 수학, 코드 생성, 논리적 추론 문제를 아우르는 다양한 벤치마크에서 오픈소스 모델 중 최고의 성능을 달성했습니다. 특히, Eurus-70B는 5가지 작업을 포괄하는 12개 테스트를 통해 GPT-3.5 Turbo를 추론 능력에서 능가하며, LeetCode에서 33.3%의 pass@1 정확도와 TheoremQA에서 32.6%의 정확도를 기록했습니다. 이는 기존 오픈소스 모델을 13.3% 이상 크게 앞서는 성과입니다. Eurus의 강력한 성능은 주로 복잡한 추론 작업을 위해 특별히 설계된 대규모 고품질 정렬 데이터셋인 UltraInteract 덕분입니다. UltraInteract는 지도 학습 미세 조정과 선호 학습 모두에 사용될 수 있습니다. 각 지시문에 대해, (1) 다양한 전략을 포함한 추론 체인을 통일된 형식으로 제공하고, (2) 환경과 비판을 포함한 다중 턴 상호작용 트레이젝토리를 포함하며, (3) 선호 학습을 촉진하기 위한 쌍별 데이터로 구성된 선호 트리를 포함합니다. UltraInteract를 통해 우리는 추론 작업을 위한 선호 학습에 대한 심층적인 탐구를 수행할 수 있었습니다. 우리의 연구 결과, 일반 대화에서 효과적이었던 일부 선호 학습 알고리즘이 추론 작업에는 덜 적합할 수 있음을 발견했습니다. 이를 바탕으로, 우리는 새로운 보상 모델링 목표를 도출했으며, 이는 UltraInteract와 결합되어 강력한 보상 모델을 이끌어냈습니다.

English

We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

선호 트리를 활용한 LLM 추론 일반화 능력 향상

Advancing LLM Reasoning Generalists with Preference Trees

초록

Summary

Support

Support