双方向進化探索を用いた自己改善言語モデル

要旨

探索は、自己改善型言語モデルやエージェントシステムにおいて、ポストトレーニングサンプル生成と推論の両方で効果的な手法として提案されてきた。しかし、Best-of-Nサンプリングや木探索といった広く用いられる手法には、疎な検証信号によって導かれること、および主に自己回帰的拡張を通じて候補を構築するため、モデルの確率質量が大きい領域に探索が制限されるという、2つの根本的な限界がある。これらの課題に対処するため、我々は前方候補進化と後方目標分解を結合した探索フレームワークである双方向進化探索（Bidirectional Evolutionary Search, BES）を提案する。前方探索では、BESは標準的な拡張に進化演算子を追加し、部分軌跡を再結合することで、単一のモデルロールアウトからは得難い候補を生成する。後方探索では、BESは元のタスクを検証可能なサブゴールに再帰的に分解し、前方探索を導く密な中間フィードバックを生成する。理論的な動機付けとして、拡張のみの探索で生成される候補は狭いエントロピーシェルに閉じ込められるのに対し、進化演算子はそこから脱出可能であること、また後方探索は正解を見つけるのに必要なサンプル数を指数関数的に削減できることを示す。実験では、主流のポストトレーニングアルゴリズムが改善に失敗する困難なポストトレーニングタスクにおいて、BESは一貫した改善を達成し、また推論時における3つのオープンな問題解決ベンチマークにおいて、BESは既存のオープンソースフレームワークを平均性能および最良性能の両方で上回る。コードと学習済みモデルはhttps://github.com/Embodied-Minds-Lab/BESで公開されている。

English

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.