事前学習におけるリフレクションの再考

要旨

言語モデルが自身の推論を省察する能力は、複雑な問題解決において重要な利点を提供する。最近の研究の多くは、この能力が強化学習中にどのように発達するかに焦点を当てているが、我々はそれが実際にははるかに早い段階、すなわちモデルの事前学習中に現れ始めることを示す。これを研究するため、我々は連鎖思考（chain-of-thought）に意図的な誤りを導入し、モデルがこれらの誤りを認識して修正することで正しい答えにたどり着けるかどうかをテストする。事前学習の異なる段階でのパフォーマンスを追跡することで、この自己修正能力が早期に現れ、時間とともに着実に向上することを観察した。例えば、4兆トークンで事前学習されたOLMo2-7Bモデルは、我々の6つの自己省察タスクにおいて自己修正を示した。

English

A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.

事前学習におけるリフレクションの再考

Rethinking Reflection in Pre-Training

要旨

Support