Orca 2:教導小型語言模型如何推理
Orca 2: Teaching Small Language Models How to Reason
November 18, 2023
作者: Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agrawal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, Ahmed Awadallah
cs.AI
摘要
Orca 1從豐富的信號中學習,例如解釋軌跡,使其能夠在BigBench Hard和AGIEval等基準測試中優於傳統的指導調整模型。在Orca 2中,我們繼續探索如何通過改進的訓練信號來增強較小的語言模型的推理能力。對於訓練小型語言模型的研究通常依賴於模仿學習,以複製更有能力模型的輸出。我們認為過分強調模仿可能會限制較小模型的潛力。我們希望教導小型語言模型為不同任務採用不同的解決策略,這些策略可能與較大模型使用的策略不同。例如,儘管較大模型可能對複雜任務提供直接答案,但較小模型可能沒有相同的能力。在Orca 2中,我們教導模型各種推理技巧(逐步、回憶再生成、回憶-推理-生成、直接答案等)。更重要的是,我們旨在幫助模型學會為每個任務確定最有效的解決策略。我們使用包含約100個任務和超過36,000個獨特提示的全面的15個不同基準測試評估Orca 2。Orca 2明顯超越了相似大小模型,並在測試零-shot環境中測試高級推理能力的複雜任務時達到類似或更好的性能水平,這些任務評估了5-10倍大的模型。我們開源Orca 2以鼓勵進一步研究開發、評估和對齊較小語言模型。
English
Orca 1 learns from rich signals, such as explanation traces, allowing it to
outperform conventional instruction-tuned models on benchmarks like BigBench
Hard and AGIEval. In Orca 2, we continue exploring how improved training
signals can enhance smaller LMs' reasoning abilities. Research on training
small LMs has often relied on imitation learning to replicate the output of
more capable models. We contend that excessive emphasis on imitation may
restrict the potential of smaller models. We seek to teach small LMs to employ
different solution strategies for different tasks, potentially different from
the one used by the larger model. For example, while larger models might
provide a direct answer to a complex task, smaller models may not have the same
capacity. In Orca 2, we teach the model various reasoning techniques
(step-by-step, recall then generate, recall-reason-generate, direct answer,
etc.). More crucially, we aim to help the model learn to determine the most
effective solution strategy for each task. We evaluate Orca 2 using a
comprehensive set of 15 diverse benchmarks (corresponding to approximately 100
tasks and over 36,000 unique prompts). Orca 2 significantly surpasses models of
similar size and attains performance levels similar or better to those of
models 5-10x larger, as assessed on complex tasks that test advanced reasoning
abilities in zero-shot settings. We open-source Orca 2 to encourage further
research on the development, evaluation, and alignment of smaller LMs.