ChatPaper.aiChatPaper

RewardHarness:自我進化代理後訓練

RewardHarness: Self-Evolving Agentic Post-Training

May 9, 2026
作者: Yuxuan Zhang, Penghui Du, Bo Li, Cong Wei, Junwen Miao, Huaisong Zhang, Songcheng Cai, Yubo Wang, Dongfu Jiang, Yuyu Zhang, Ping Nie, Wenhu Chen, Changqian Yu, Kelsey R. Allen
cs.AI

摘要

評估指令引導的圖像編輯需要能反映細微人類偏好的獎勵機制,然而現行獎勵模型通常依賴大規模偏好標註及額外模型訓練。這造成了數據效率落差:人類往往能僅從少數範例推斷出目標評估標準,而模型卻需藉由數十萬筆比較數據進行訓練。我們提出RewardHarness——一種自我演化代理人獎勵框架,將獎勵建模重新定義為情境演化而非權重優化。此框架不從大規模標註中學習,而是透過僅100個偏好示範案例,迭代演化工具與技能函式庫,進而與人類偏好對齊。給定原始圖像、候選編輯圖像及編輯指令後,由協調器從維護的函式庫中選取最相關的工具與技能子集,再由凍結的子代理人運用這些元素構建推理鏈條,產出偏好判斷。透過比對預測判斷與真實偏好,並分析推理過程中的成功與失敗案例,協調器能自動優化其工具與技能函式庫,無需額外人工標註。僅使用EditReward偏好數據中0.05%的資料,RewardHarness在圖像編輯評估基準上達成47.4%的平均準確率,超越GPT-5達5.3個百分點。當作為GRPO微調的獎勵訊號時,經強化學習調整的模型在ImgEdit-Bench上獲得3.52分。專案頁面:https://rewardharness.com。
English
Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.