言語モデルは状態価値推定を自己改善し、より良い探索を実現できる

要旨

多段階推論タスクにおける正解タスク完了報酬や人間によるデモンストレーションの収集は、特にウェブタスクのようなインタラクティブな領域では、コストがかかり時間もかかることが多い。このボトルネックに対処するため、我々は自己教師あり手法である「自己学習型先読み（self-taught lookahead）」を提案する。この手法は、状態遷移ダイナミクスを活用して、言語モデル制御の探索を効果的に導く価値モデルを訓練する。我々は、自己学習型先読みで改善された中規模（80億パラメータ）のオープンウェイト価値モデルが、gpt-4oのような最先端の大規模言語モデル（LLM）を価値モデルとして使用した場合の性能に匹敵することを発見した。さらに、自己学習型先読みは、正解報酬に依存することなく、従来のLLMベースの木探索と比較して性能を20％向上させ、コストを37分の1に削減することを確認した。

English

Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight value models improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.

言語モデルは状態価値推定を自己改善し、より良い探索を実現できる

Language Models can Self-Improve at State-Value Estimation for Better Search

要旨

Summary

Support

Support