Orca 2: 소규모 언어 모델에게 추론 방법을 가르치기

초록

Orca 1은 설명 트레이스(explanation traces)와 같은 풍부한 신호를 학습하여 BigBench Hard 및 AGIEval과 같은 벤치마크에서 기존의 지시 튜닝(instruction-tuned) 모델을 능가합니다. Orca 2에서는 개선된 학습 신호가 더 작은 언어 모델(LM)의 추론 능력을 어떻게 향상시킬 수 있는지 계속 탐구합니다. 작은 LM을 훈련시키는 연구는 종종 더 강력한 모델의 출력을 복제하기 위한 모방 학습(imitation learning)에 의존해 왔습니다. 우리는 과도한 모방에 대한 강조가 작은 모델의 잠재력을 제한할 수 있다고 주장합니다. 우리는 작은 LM이 더 큰 모델이 사용하는 전략과는 다를 수 있는 다양한 작업에 대해 다른 해결 전략을 사용하도록 가르치고자 합니다. 예를 들어, 더 큰 모델이 복잡한 작업에 대해 직접적인 답을 제공할 수 있는 반면, 작은 모델은 동일한 능력을 갖추지 못할 수 있습니다. Orca 2에서는 모델에게 단계별 접근(step-by-step), 기억 후 생성(recall then generate), 기억-추론-생성(recall-reason-generate), 직접 답변(direct answer) 등 다양한 추론 기법을 가르칩니다. 더 중요한 것은, 모델이 각 작업에 가장 효과적인 해결 전략을 결정하는 방법을 학습하도록 돕는 것입니다. 우리는 약 100개의 작업과 36,000개 이상의 고유한 프롬프트에 해당하는 15개의 다양한 벤치마크를 사용하여 Orca 2를 평가합니다. Orca 2는 유사한 크기의 모델을 크게 능가하며, 제로샷(zero-shot) 설정에서 고급 추론 능력을 테스트하는 복잡한 작업에서 5-10배 더 큰 모델과 유사하거나 더 나은 성능을 달성합니다. 우리는 Orca 2를 오픈소스로 공개하여 작은 LM의 개발, 평가, 정렬에 대한 추가 연구를 장려합니다.

English

Orca 1 learns from rich signals, such as explanation traces, allowing it to outperform conventional instruction-tuned models on benchmarks like BigBench Hard and AGIEval. In Orca 2, we continue exploring how improved training signals can enhance smaller LMs' reasoning abilities. Research on training small LMs has often relied on imitation learning to replicate the output of more capable models. We contend that excessive emphasis on imitation may restrict the potential of smaller models. We seek to teach small LMs to employ different solution strategies for different tasks, potentially different from the one used by the larger model. For example, while larger models might provide a direct answer to a complex task, smaller models may not have the same capacity. In Orca 2, we teach the model various reasoning techniques (step-by-step, recall then generate, recall-reason-generate, direct answer, etc.). More crucially, we aim to help the model learn to determine the most effective solution strategy for each task. We evaluate Orca 2 using a comprehensive set of 15 diverse benchmarks (corresponding to approximately 100 tasks and over 36,000 unique prompts). Orca 2 significantly surpasses models of similar size and attains performance levels similar or better to those of models 5-10x larger, as assessed on complex tasks that test advanced reasoning abilities in zero-shot settings. We open-source Orca 2 to encourage further research on the development, evaluation, and alignment of smaller LMs.

Orca 2: 소규모 언어 모델에게 추론 방법을 가르치기

Orca 2: Teaching Small Language Models How to Reason

초록

Support