ChatPaper.aiChatPaper

价值导向的生成式推荐:结构化采样与优化策略

Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

February 11, 2026
作者: Jie Jiang, Yangru Huang, Zeyu Wang, Changping Wang, Yuling Xiong, Jun Zhang, Huan Yu
cs.AI

摘要

基于自回归模型的生成式推荐将检索与排序统一至条件生成框架。然而使用强化学习对这些模型进行微调时,常面临概率与奖励错配的根本问题。传统似然主导的解码策略(如束搜索)会表现出对局部高概率前缀的短视偏好,引发两个关键缺陷:(1)探索不足:低概率分支中的高奖励项因过早剪枝而极少被采样;(2)优势压缩:共享高概率前缀的轨迹获得高度相关的奖励,组内方差过低导致强化学习缺乏有效比较信号。针对这些问题,我们提出V-STAR框架——一种值函数引导的采样与树状优势强化学习方案。该框架通过两个协同组件形成自演进闭环:首先设计值函数引导高效解码机制,通过识别决策节点并选择性深化高潜力前缀,在不进行穷举树搜索的前提下提升探索效率;其次提出Sibling-GRPO算法,利用生成的树状拓扑计算兄弟节点相对优势,将学习信号聚焦于关键分支决策。在离线和在线数据集上的大量实验表明,V-STAR在严格延迟约束下不仅能超越现有最优基线模型的准确率,还能生成更具多样性的候选集。
English
Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.
PDF11February 13, 2026