反思生成式推荐系统分词器：超越大语言模型的推荐原生编码与语义量化

摘要

基于语义ID（SID）的推荐是扩展序列推荐系统的潜力范式，但现有方法大多遵循语义中心流程：通过基础模型学习物品嵌入，并采用通用量化方案进行离散化。这种设计与生成式推荐目标存在偏差：语义嵌入与协同预测弱关联，通用量化在降低自回归建模的序列不确定性方面效率低下。为此，我们提出ReSID——一个面向推荐场景的、原理性SID框架，从信息保持和序列可预测性角度重构表征学习与量化过程，且无需依赖大语言模型。ReSID包含两个核心组件：（1）场感知掩码自编码（FAMAE），从结构化特征中学习预测充分的物品表征；（2）全局对齐正交量化（GAOQ），通过联合降低语义模糊性和前缀条件不确定性，生成紧凑且可预测的SID序列。理论分析和十项数据集的广泛实验验证了ReSID的有效性。该方法在强序列基线及基于SID的生成基线模型上平均提升超过10%，同时将标记化成本降低高达122倍。代码已开源：https://github.com/FuCongResearchSquad/ReSID。

English

Semantic ID (SID)-based recommendation is a promising paradigm for scaling sequential recommender systems, but existing methods largely follow a semantic-centric pipeline: item embeddings are learned from foundation models and discretized using generic quantization schemes. This design is misaligned with generative recommendation objectives: semantic embeddings are weakly coupled with collaborative prediction, and generic quantization is inefficient at reducing sequential uncertainty for autoregressive modeling. To address these, we propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty. Theoretical analysis and extensive experiments across ten datasets show the effectiveness of ReSID. ReSID consistently outperforms strong sequential and SID-based generative baselines by an average of over 10%, while reducing tokenization cost by up to 122x. Code is available at https://github.com/FuCongResearchSquad/ReSID.

反思生成式推荐系统分词器：超越大语言模型的推荐原生编码与语义量化

Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs

摘要

Support