F-GRPO: 因子化组相对策略优化用于统一候选生成与排序

摘要

传统检索流程通过候选检索和重排序两个阶段优化效用，其中排序操作基于预定义的候选集。大语言模型（LLMs）将其扩展为生成过程：给定候选池，LLM能在单次自回归过程中生成子集并排序。然而，这种灵活性引入新的优化挑战：模型必须在组合输出空间中搜索，同时仅在完整排序列表生成后才能获得效用反馈。由于该反馈定义于完成序列之上，无法区分不良结果源于未能生成相关子集，还是未能对子集进行正确排序。这种信用分配差距使得端到端优化不稳定且样本效率低下。现有系统通常通过分离候选生成与排序来解决此问题。然而，这种解耦仍与下游效用存在偏差，因为排序受限于其接收的候选集。为弥合这一差距，我们提出统一框架，在单次自回归展开中同时执行两项任务，并通过因子化组相对策略优化（F-GRPO）实现端到端优化。该框架将策略分解为候选生成与排序两个阶段，共享单一LLM主干，并利用顺序无关的覆盖奖励与位置感知的效用奖励进行联合训练。为应对由此产生的阶段特异性信用分配问题，我们在两阶段序列级目标中分别为生成和排序设置独立的组相对优势。在序列推荐和多跳问答基准测试中，F-GRPO在首位结果性能上超越GRPO和解耦基线，优于监督式替代方案，且与强零样本重排序器竞争力相当，推理时无需架构变更。

English

Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from ranking. However, such decoupling remains misaligned with downstream utility because ranking is limited by the candidate set it receives. To bridge this gap, we propose a unified framework that performs both within a single autoregressive rollout and optimizes them end-to-end via factorized group-relative policy optimization (F-GRPO). Our framework factorizes the policy into candidate generation and ranking while sharing a single LLM backbone, and jointly trains them with an order-invariant coverage reward and a position-aware utility reward. To address the resulting phase-specific credit assignment problem, we use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers, with no architectural changes at inference time.