自改進任務空間的引導建構

摘要

在許多任務領域中，進步往往來自於對先前解決方案嘗試的反覆修正。訓練能夠在推理時可靠地自我改進的代理，是強化學習（RL）的一個自然目標。然而，天真的方法假設了一個固定的最大迭代深度，這既可能成本高昂，又顯得武斷。我們提出了探索性迭代（ExIt），這是一類自動課程RL方法，直接利用自我改進任務的遞歸結構，訓練大型語言模型（LLMs）在推理時進行多步自我改進，同時僅在最具信息量的單步迭代上進行訓練。ExIt通過選擇性地採集在一個回合中遇到的最具信息量的中間部分歷史來擴展任務空間，將這些起點視為新的自我迭代任務實例，以訓練自我改進策略。ExIt還可以與顯式探索機制結合，以維持更大的任務多樣性。在多個領域中，包括競賽數學、多輪工具使用和機器學習工程，我們展示了ExIt策略，無論是從單個還是多個任務實例開始，都能產生在保留任務實例上表現出強推理時自我改進的策略，並且能夠在超出訓練期間平均迭代深度的步數預算內迭代向更高性能邁進。

English

Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.

自改進任務空間的引導建構

Bootstrapping Task Spaces for Self-Improvement

摘要

Support