重新審視生成式推薦系統分詞器：超越大型語言模型的推薦原生編碼與語義量化

摘要

基于语义ID的推荐是扩展序列推荐系统的潜力范式，但现有方法大多遵循语义中心的流程：从基础模型学习物品嵌入，并使用通用量化方案进行离散化。这种设计与生成式推荐目标存在偏差：语义嵌入与协同预测弱关联，而通用量化在降低自回归建模的序列不确定性方面效率低下。为此，我们提出ReSID——一个面向推荐的原生化、原则性语义ID框架，从信息保存和序列可预测性角度重构表征学习与量化过程，且无需依赖大语言模型。ReSID包含两个核心组件：（一）场感知掩码自编码器，从结构化特征学习预测充分的物品表征；（二）全局对齐正交量化器，通过联合降低语义模糊性和前缀条件不确定性，生成紧凑且可预测的语义ID序列。理论分析和十大数据集的实验结果表明，ReSID在持续超越强序列基线及基于语义ID的生成式基线平均超过10%的同时，将标记化成本降低高达122倍。代码已开源：https://github.com/FuCongResearchSquad/ReSID。

English

Semantic ID (SID)-based recommendation is a promising paradigm for scaling sequential recommender systems, but existing methods largely follow a semantic-centric pipeline: item embeddings are learned from foundation models and discretized using generic quantization schemes. This design is misaligned with generative recommendation objectives: semantic embeddings are weakly coupled with collaborative prediction, and generic quantization is inefficient at reducing sequential uncertainty for autoregressive modeling. To address these, we propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty. Theoretical analysis and extensive experiments across ten datasets show the effectiveness of ReSID. ReSID consistently outperforms strong sequential and SID-based generative baselines by an average of over 10%, while reducing tokenization cost by up to 122x. Code is available at https://github.com/FuCongResearchSquad/ReSID.

重新審視生成式推薦系統分詞器：超越大型語言模型的推薦原生編碼與語義量化

Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs

摘要

Support