理解基于大型语言模型的迭代式生成优化所面临的挑战

摘要

生成式优化利用大型语言模型（LLMs），通过执行反馈迭代改进各类产物（如代码、工作流或提示）。这是构建自我改进智能体的前景广阔的方法，但在实践中仍显脆弱：尽管研究活跃，仅有9%的受访智能体采用了自动化优化。我们认为这种脆弱性源于工程师在搭建学习循环时必须做出的"隐性"设计选择：优化器可编辑哪些内容？每次更新时应提供何种"恰当"的学习证据？我们研究了影响大多数应用的三个因素：初始产物、执行轨迹的信用分配范围，以及将试错批次转化为学习证据的方式。通过在MLAgentBench、Atari和BigBench Extra Hard的案例研究，我们发现这些设计决策能决定生成式优化的成败，但在先前研究中鲜少被明确讨论。不同初始产物决定了MLAgentBench中可达的解决方案空间，截断轨迹仍能提升Atari智能体性能，而增大最小批尺寸在BBEH任务上并不能单调提升泛化能力。我们得出结论：缺乏跨领域的简单通用学习循环搭建方法，是实现产业化应用的主要障碍。本文针对这些选择提供了实践指导。

English

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.