重新審視生成式推薦系統分詞器:超越大型語言模型的推薦原生編碼與語義量化
Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs
February 2, 2026
作者: Yu Liang, Zhongjin Zhang, Yuxuan Zhu, Kerui Zhang, Zhiluohan Guo, Wenhang Zhou, Zonqi Yang, Kangle Wu, Yabo Ni, Anxiang Zeng, Cong Fu, Jianxin Wang, Jiazhi Xia
cs.AI
摘要
基于语义ID的推荐是扩展序列推荐系统的潜力范式,但现有方法大多遵循语义中心的流程:从基础模型学习物品嵌入,并使用通用量化方案进行离散化。这种设计与生成式推荐目标存在偏差:语义嵌入与协同预测弱关联,而通用量化在降低自回归建模的序列不确定性方面效率低下。为此,我们提出ReSID——一个面向推荐的原生化、原则性语义ID框架,从信息保存和序列可预测性角度重构表征学习与量化过程,且无需依赖大语言模型。ReSID包含两个核心组件:(一)场感知掩码自编码器,从结构化特征学习预测充分的物品表征;(二)全局对齐正交量化器,通过联合降低语义模糊性和前缀条件不确定性,生成紧凑且可预测的语义ID序列。理论分析和十大数据集的实验结果表明,ReSID在持续超越强序列基线及基于语义ID的生成式基线平均超过10%的同时,将标记化成本降低高达122倍。代码已开源:https://github.com/FuCongResearchSquad/ReSID。
English
Semantic ID (SID)-based recommendation is a promising paradigm for scaling sequential recommender systems, but existing methods largely follow a semantic-centric pipeline: item embeddings are learned from foundation models and discretized using generic quantization schemes. This design is misaligned with generative recommendation objectives: semantic embeddings are weakly coupled with collaborative prediction, and generic quantization is inefficient at reducing sequential uncertainty for autoregressive modeling. To address these, we propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty. Theoretical analysis and extensive experiments across ten datasets show the effectiveness of ReSID. ReSID consistently outperforms strong sequential and SID-based generative baselines by an average of over 10%, while reducing tokenization cost by up to 122x. Code is available at https://github.com/FuCongResearchSquad/ReSID.