Orca 2:教导小型语言模型如何推理
Orca 2: Teaching Small Language Models How to Reason
November 18, 2023
作者: Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agrawal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, Ahmed Awadallah
cs.AI
摘要
Orca 1从丰富的信号中学习,如解释迹线,使其能够在诸如BigBench Hard和AGIEval等基准测试中胜过传统的指导调整模型。在Orca 2中,我们继续探索如何改进训练信号以增强较小LM的推理能力。对于训练小LM的研究通常依赖于模仿学习,以复制更有能力模型的输出。我们认为过分强调模仿可能会限制较小模型的潜力。我们试图教导小LM为不同任务采用不同的解决策略,这些策略可能与较大模型使用的策略不同。例如,虽然较大模型可能会直接回答复杂任务,但较小模型可能没有相同的能力。在Orca 2中,我们教导模型各种推理技术(逐步、回忆再生成、回忆-推理-生成、直接回答等)。更重要的是,我们旨在帮助模型学会为每个任务确定最有效的解决策略。我们使用包含大约100个任务和超过36,000个独特提示的全面的15个不同基准测试来评估Orca 2。Orca 2显著超越了相似规模的模型,并在复杂任务上取得了类似或更好的性能水平,这些任务在零-shot设置中测试高级推理能力。我们开源Orca 2以鼓励进一步研究较小LM的开发、评估和对齐。
English
Orca 1 learns from rich signals, such as explanation traces, allowing it to
outperform conventional instruction-tuned models on benchmarks like BigBench
Hard and AGIEval. In Orca 2, we continue exploring how improved training
signals can enhance smaller LMs' reasoning abilities. Research on training
small LMs has often relied on imitation learning to replicate the output of
more capable models. We contend that excessive emphasis on imitation may
restrict the potential of smaller models. We seek to teach small LMs to employ
different solution strategies for different tasks, potentially different from
the one used by the larger model. For example, while larger models might
provide a direct answer to a complex task, smaller models may not have the same
capacity. In Orca 2, we teach the model various reasoning techniques
(step-by-step, recall then generate, recall-reason-generate, direct answer,
etc.). More crucially, we aim to help the model learn to determine the most
effective solution strategy for each task. We evaluate Orca 2 using a
comprehensive set of 15 diverse benchmarks (corresponding to approximately 100
tasks and over 36,000 unique prompts). Orca 2 significantly surpasses models of
similar size and attains performance levels similar or better to those of
models 5-10x larger, as assessed on complex tasks that test advanced reasoning
abilities in zero-shot settings. We open-source Orca 2 to encourage further
research on the development, evaluation, and alignment of smaller LMs.