理解大型語言模型在迭代生成優化中面臨的挑戰

摘要

生成式優化利用大型語言模型，通過執行反饋迭代改進產出物（如代碼、工作流程或提示）。這是一種構建自我改進智能體的極具前景的方法，但在實踐中仍顯脆弱：儘管研究活躍，僅有9%的受調查智能體採用自動化優化。我們認為這種脆弱性源於工程師在建立學習循環時必須做出「隱性」設計選擇：優化器可編輯哪些內容？每次更新時應提供何種「正確」的學習證據？我們研究了三項影響多數應用的因素：初始產出物、執行軌跡的信用視野，以及將試錯批次化為學習證據的方式。透過在MLAgentBench、Atari和BigBench Extra Hard的案例研究，我們發現這些設計決策能決定生成式優化的成敗，卻鮮少在過往研究中被明確闡述。不同的初始產出物決定了MLAgentBench中可達成的解決方案路徑，截斷的軌跡仍能提升Atari智能體性能，而更大的最小批次量並不會單調改善BBEH的泛化能力。我們結論指出，缺乏跨領域的簡易通用學習循環設置方法，是實現產業化應用的主要障礙。本文針對這些選擇提供了實用指引。

English

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

理解大型語言模型在迭代生成優化中面臨的挑戰

Understanding the Challenges in Iterative Generative Optimization with LLMs

摘要

Support