SSL：智能体优化中差异化引导的甜点学习法

摘要

可验证奖励的强化学习已成为训练智能代理的强大范式。然而，现有方法通常采用二元奖励机制，无法捕捉达成相同结果的不同轨迹之间的质量差异，从而忽视了解决方案空间内的潜在多样性。受网球"甜点"概念（球拍核心区域能产生最佳击球效果）的启发，我们提出甜点学习（SSL）这一新型框架，为智能体优化提供差异化指导。SSL遵循一个简单而有效的原则：通过渐进式放大的分层奖励机制，引导策略趋向解空间的甜点区域。该原则可自然适配多种任务类型：视觉感知任务利用距离分层建模奖励接近度，而复杂推理任务则对向可行解决方案的渐进进展给予奖励。我们从理论上证明SSL能保持最优解的顺序性并提升梯度信噪比，从而促进更有导向性的优化。在GUI感知、短/长期规划和复杂推理等任务上的大量实验表明，该方法在12个基准测试中均优于强基线模型，样本效率提升最高达2.5倍，并展现出有效的跨任务迁移能力。本研究将SSL确立为训练高效鲁棒智能代理的通用原则。

English

Reinforcement learning with verifiable rewards has emerged as a powerful paradigm for training intelligent agents. However, existing methods typically employ binary rewards that fail to capture quality differences among trajectories achieving identical outcomes, thereby overlooking potential diversity within the solution space. Inspired by the ``sweet spot'' concept in tennis-the racket's core region that produces optimal hitting effects, we introduce Sweet Spot Learning (SSL), a novel framework that provides differentiated guidance for agent optimization. SSL follows a simple yet effective principle: progressively amplified, tiered rewards guide policies toward the sweet-spot region of the solution space. This principle naturally adapts across diverse tasks: visual perception tasks leverage distance-tiered modeling to reward proximity, while complex reasoning tasks reward incremental progress toward promising solutions. We theoretically demonstrate that SSL preserves optimal solution ordering and enhances the gradient signal-to-noise ratio, thereby fostering more directed optimization. Extensive experiments across GUI perception, short/long-term planning, and complex reasoning tasks show consistent improvements over strong baselines on 12 benchmarks, achieving up to 2.5X sample efficiency gains and effective cross-task transferability. Our work establishes SSL as a general principle for training capable and robust agents.

SSL：智能体优化中差异化引导的甜点学习法

SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization

摘要

Support