PACEvolve++:改進進化搜索代理的測試時學習
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
May 7, 2026
作者: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Weili Wang, Ed H. Chi, Shivaram Venkataraman, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
cs.AI
摘要
大型語言模型已成為演化搜索的驅動力,但多數系統依賴固定的、由提示引發的策略來取樣下一個候選項。這限制了在實際工程與研究任務中的適應能力,其中評估成本高昂,且進展取決於學習任務特定的搜索動態。我們提出 PACEvolve++,這是一個顧問模型強化學習框架,用於演化搜索代理中的測試時策略適應。PACEvolve++ 將策略性搜索決策與實現分離:可訓練的顧問生成、評估並選擇假設,而更強大的前線模型則將選定的假設轉化為可執行的候選項。為了在非平穩回饋下訓練顧問,我們提出了一種相位自適應方法,該方法根據演化過程的不同階段調整其優化策略。在演化初期,它使用群體相對回饋來學習廣泛的搜索偏好;後期,隨著獎勵差距縮小,它強調最佳-k 前線貢獻以支持穩定的細化。在專家並行負載平衡、序列推薦以及蛋白質適應度外推等任務中,PACEvolve++ 超越了搭配前線模型的當前最佳演化搜索框架,實現了更快的收斂速度,並在演化搜索過程中穩定了測試時訓練。
English
Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt-elicited policy to sample next candidates. This limits adaptation in practical engineering and research tasks, where evaluations are expensive, and progress depends on learning task-specific search dynamics. We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-k frontier contribution to support stable refinement. Across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, PACEvolve++ outperforms the state-of-the-art evolutionary search framework with frontier models, achieving faster convergence and stabilizing test-time training during evolutionary search.