オープンエンド生成のための逆推論的推論

要旨

「深層推論」のパラダイムは、数学のような検証可能な領域において重要な進展を遂げてきたが、そのオープンエンドで創造的な生成への応用は依然として重大な課題である。推論能力を付与するための2つの主要な手法——強化学習（RL）と指示蒸留——は、この領域では限界に直面している。RLは明確な報酬信号や高品質な報酬モデルの欠如に苦しみ、蒸留はコストが高すぎる上に教師モデルの能力に制限される。これらの制約を克服するため、我々はREverse-Engineered Reasoning（REER）という新たなパラダイムを提案する。これは、試行錯誤や模倣を通じて推論プロセスを「順方向」に構築するのではなく、既知の優れた解から「逆方向」に働きかけ、それらを生成し得る潜在的な段階的な深層推論プロセスを計算的に発見するアプローチである。このスケーラブルで勾配不要な手法を用いて、我々はDeepWriting-20Kという大規模データセットを整備し公開した。これは、オープンエンドタスクにおける20,000の深層推論軌跡を収録したものである。このデータで訓練された我々のモデル、DeepWriter-8Bは、強力なオープンソースのベースラインを凌駕するだけでなく、GPT-4oやClaude 3.5のような主要なプロプライエタリモデルと競合し、時にはそれを上回る性能を達成した。

English

While the ``deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process ``forwards'' through trial-and-error or imitation, REER works ``backwards'' from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.

オープンエンド生成のための逆推論的推論

Reverse-Engineered Reasoning for Open-Ended Generation

要旨

Support