价值导向的生成式推荐:结构化采样与优化策略的精要投放
Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation
February 11, 2026
作者: Jie Jiang, Yangru Huang, Zeyu Wang, Changping Wang, Yuling Xiong, Jun Zhang, Huan Yu
cs.AI
摘要
基于自回归模型的生成式推荐已将检索与排序统一至条件生成框架中。然而,使用强化学习(RL)对这些模型进行微调时,常面临概率与奖励错配的根本性问题。传统似然主导的解码策略(如集束搜索)会表现出对局部高概率前缀的短视偏好,导致两个关键缺陷:(1)探索不足:低概率分支中的高奖励项目因过早剪枝而极少被采样;(2)优势压缩:共享高概率前缀的轨迹会获得高度相关且组内方差极低的奖励,导致RL缺乏有效比较信号。为解决这些难题,我们提出V-STAR框架——一种基于价值引导采样与树状结构优势强化的解决方案。该框架通过两个协同组件形成自演进闭环:首先开发了价值引导高效解码(VED)方法,通过识别决策节点并选择性深化高潜力前缀路径,在无需穷举树搜索的前提下提升探索效率;其次提出Sibling-GRPO算法,利用生成的树状拓扑计算兄弟节点相对优势,将学习信号聚焦于关键分支决策。在离线和在线数据集上的大量实验表明,V-STAR在严格延迟约束下不仅能超越现有最优基线模型,更在准确率和候选集多样性方面实现显著提升。
English
Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.