選好ツリーを用いたLLM推論ジェネラリストの進化

要旨

我々は、推論に最適化された大規模言語モデル（LLM）群であるEurusを紹介する。Mistral-7BとCodeLlama-70BをファインチューニングしたEurusモデルは、数学、コード生成、論理推論問題を含む多様なベンチマークにおいて、オープンソースモデルの中で最先端の結果を達成している。特に、Eurus-70Bは、5つのタスクをカバーする12のテストを通じた包括的なベンチマークにおいて、GPT-3.5 Turboを推論能力で上回り、LeetCodeでは33.3%、TheoremQAでは32.6%のpass@1精度を達成し、既存のオープンソースモデルを13.3%以上の差で大幅に凌駕している。Eurusの強力な性能は、主に複雑な推論タスクに特化して設計された新たにキュレーションされた大規模で高品質なアライメントデータセットであるUltraInteractに起因している。UltraInteractは、教師ありファインチューニングと選好学習の両方に使用できる。各指示に対して、統一フォーマットでの多様な計画戦略を含む推論チェーン、環境と批評との多段階インタラクショントラジェクトリ、選好学習を促進するためのペアワイズデータを含む選好ツリーを備えている。UltraInteractにより、推論タスクにおける選好学習の詳細な探求が可能となる。我々の調査から、一般的な会話における有効性と比較して、いくつかの確立された選好学習アルゴリズムが推論タスクにはあまり適していない可能性があることが明らかになった。これに着想を得て、我々は新しい報酬モデリング目的関数を導出し、UltraInteractと組み合わせることで強力な報酬モデルを実現した。

English

We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

選好ツリーを用いたLLM推論ジェネラリストの進化

Advancing LLM Reasoning Generalists with Preference Trees

要旨

Support