自改进的引导式任务空间构建

摘要

在众多任务领域中，进步往往源于对先前解决方案尝试的反复修正。训练能够在推理时可靠地自我改进的智能体，自然成为强化学习（RL）的目标。然而，传统方法假设了一个固定的最大迭代深度，这既成本高昂又显得武断。我们提出了探索性迭代（Exploratory Iteration, ExIt），这是一类自课程RL方法，它直接利用自我改进任务的循环结构，训练大型语言模型（LLMs）在推理时执行多步自我改进，同时仅针对最具信息量的单步迭代进行训练。ExIt通过有选择地采样在任务执行过程中遇到的最具信息量的中间部分历史记录来扩展任务空间，将这些起点视为新的自我迭代任务实例，以训练自我改进策略。ExIt还可以与显式探索机制结合，维持更高的任务多样性。在多个领域，包括竞赛数学、多轮工具使用及机器学习工程中，我们展示了ExIt策略，无论是从单一还是多个任务实例出发，都能生成在保留任务实例上展现出强大推理时自我改进能力的策略，并能在超出训练期间平均迭代深度的步数预算内，迭代提升至更高性能。

English

Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.

自改进的引导式任务空间构建

Bootstrapping Task Spaces for Self-Improvement

摘要

Support