MCTS를 통한 LLMs의 자가 개선 방향: 단계별 지식을 활용한 커리큘럼 선호 학습

초록

최근에 몬테카를로 트리 탐색(Monte Carlo Tree Search, MCTS)은 LLM의 추론 능력을 향상시키는 강력한 기술로 부상했습니다. SFT 또는 DPO와 같은 기술은 LLM이 MCTS로부터 고품질 행동을 추출하여 추론 성능을 향상시키도록 했습니다. 그러나 기존의 증류(distillation) 방법은 MCTS에 의해 생성된 풍부한 궤적 정보를 충분히 활용하지 못하여 LLM 추론 능력의 향상 가능성을 제한하고 있습니다. 본 논문에서는 AlphaLLM-CPL이라는 새로운 쌍대(pairwise) 훈련 프레임워크를 제안합니다. 이 프레임워크는 LLM이 MCTS 행동 증류를 통해 자체 개선할 수 있도록 합니다. AlphaLLM-CPL은 MCTS 궤적을 효율적으로 활용하기 위해 두 가지 주요 혁신을 통해 작동합니다. 첫째, AlphaLLM-CPL은 탐색 트리에서 동일한 부모를 공유하는 자식 노드로부터 단계별 궤적 쌍을 구성하여 더 효과적인 MCTS 행동 증류를 위한 단계 수준 정보를 제공합니다. 둘째, AlphaLLM-CPL은 커리큘럼 선호 학습을 도입하여 각 오프라인 훈련 에포크에서 궤적 쌍의 훈련 순서를 동적으로 조정하여 중요한 학습 단계를 우선시하고 과적합을 완화합니다. 수학적 추론 작업에 대한 실험 결과는 AlphaLLM-CPL이 이전의 MCTS 행동 증류 방법을 크게 능가하여 LLM의 추론 능력을 상당히 향상시키는 것을 보여줍니다.

English

Monte Carlo Tree Search (MCTS) has recently emerged as a powerful technique for enhancing the reasoning capabilities of LLMs. Techniques such as SFT or DPO have enabled LLMs to distill high-quality behaviors from MCTS, improving their reasoning performance. However, existing distillation methods underutilize the rich trajectory information generated by MCTS, limiting the potential for improvements in LLM reasoning. In this paper, we propose AlphaLLM-CPL, a novel pairwise training framework that enables LLMs to self-improve through MCTS behavior distillation. AlphaLLM-CPL efficiently leverages MCTS trajectories via two key innovations: (1) AlphaLLM-CPL constructs stepwise trajectory pairs from child nodes sharing the same parent in the search tree, providing step-level information for more effective MCTS behavior distillation. (2) AlphaLLM-CPL introduces curriculum preference learning, dynamically adjusting the training sequence of trajectory pairs in each offline training epoch to prioritize critical learning steps and mitigate overfitting. Experimental results on mathematical reasoning tasks demonstrate that AlphaLLM-CPL significantly outperforms previous MCTS behavior distillation methods, substantially boosting the reasoning capabilities of LLMs.

MCTS를 통한 LLMs의 자가 개선 방향: 단계별 지식을 활용한 커리큘럼 선호 학습

Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning

초록

Support