수학 문제 해결을 위한 대규모 언어 모델 미세 조정 개선

초록

많은 자연어 처리 과제에서 성공을 거두었음에도 불구하고, 수학 문제 해결은 대형 언어 모델(LLM)에게 여전히 상당한 도전 과제로 남아 있습니다. LLM의 수학 문제 해결에서 '단일 시도 정확도(pass-at-one)'와 'N회 시도 정확도(pass-at-N)' 사이에 큰 격차가 존재하는데, 이는 LLM이 올바른 해결책을 찾는 데 가까이 있음을 시사하며, LLM의 성능을 끌어올리기 위한 미세 조정(fine-tuning) 방법 탐구의 동기를 부여합니다. 우리는 도전적인 MATH 데이터셋을 사용하여 세 가지 미세 조정 전략을 연구했습니다: (1) 해결책 미세 조정(solution fine-tuning) - 주어진 수학 문제에 대한 상세한 해결책을 생성하도록 미세 조정; (2) 해결책 클러스터 재순위화(solution-cluster re-ranking) - 생성된 후보 해결책 클러스터 중에서 선택하도록 해결책 검증/평가자로서 LLM을 미세 조정; (3) 다중 작업 순차적 미세 조정(multi-task sequential fine-tuning) - 해결책 생성과 평가 작업을 효율적으로 통합하여 LLM 성능을 향상. 이러한 방법들을 통해 일련의 PaLM 2 모델에 대한 철저한 실증 연구를 수행한 결과, 다음과 같은 사실을 발견했습니다: (1) 미세 조정에 사용된 단계별 해결책의 질과 스타일이 모델 성능에 상당한 영향을 미칠 수 있음; (2) 해결책 재순위화와 다수결 투표(majority voting)는 각각 사용될 때 모델 성능 향상에 효과적이지만, 함께 사용할 경우 더 큰 성능 향상을 기대할 수 있음; (3) 해결책 생성과 평가 작업을 순차적으로 분리하는 다중 작업 미세 조정은 해결책 미세 조정 기준선과 비교하여 향상된 성능을 제공할 수 있음. 이러한 통찰을 바탕으로, 우리는 미세 조정된 PaLM 2-L 모델이 MATH 데이터셋에서 약 58.8%의 정확도를 달성하는 미세 조정 레시피를 설계했으며, 이는 다수결 투표를 적용한 사전 학습된 PaLM 2-L 모델의 소수 샷(few-shot) 성능 대비 11.2%의 정확도 향상을 나타냅니다.

English

Despite their success in many natural language tasks, solving math problems remains a significant challenge for large language models (LLMs). A large gap exists between LLMs' pass-at-one and pass-at-N performance in solving math problems, suggesting LLMs might be close to finding correct solutions, motivating our exploration of fine-tuning methods to unlock LLMs' performance. Using the challenging MATH dataset, we investigate three fine-tuning strategies: (1) solution fine-tuning, where we fine-tune to generate a detailed solution for a given math problem; (2) solution-cluster re-ranking, where the LLM is fine-tuned as a solution verifier/evaluator to choose among generated candidate solution clusters; (3) multi-task sequential fine-tuning, which integrates both solution generation and evaluation tasks together efficiently to enhance the LLM performance. With these methods, we present a thorough empirical study on a series of PaLM 2 models and find: (1) The quality and style of the step-by-step solutions used for fine-tuning can make a significant impact on the model performance; (2) While solution re-ranking and majority voting are both effective for improving the model performance when used separately, they can also be used together for an even greater performance boost; (3) Multi-task fine-tuning that sequentially separates the solution generation and evaluation tasks can offer improved performance compared with the solution fine-tuning baseline. Guided by these insights, we design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models, an 11.2% accuracy improvement over the few-shot performance of pre-trained PaLM 2-L model with majority voting.

수학 문제 해결을 위한 대규모 언어 모델 미세 조정 개선

Improving Large Language Model Fine-tuning for Solving Math Problems

초록

Support