ChatPaper.aiChatPaper

反思生成式推荐系统分词器:超越大语言模型的推荐原生编码与语义量化

Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs

February 2, 2026
作者: Yu Liang, Zhongjin Zhang, Yuxuan Zhu, Kerui Zhang, Zhiluohan Guo, Wenhang Zhou, Zonqi Yang, Kangle Wu, Yabo Ni, Anxiang Zeng, Cong Fu, Jianxin Wang, Jiazhi Xia
cs.AI

摘要

基于语义ID(SID)的推荐是扩展序列推荐系统的潜力范式,但现有方法大多遵循语义中心流程:通过基础模型学习物品嵌入,并采用通用量化方案进行离散化。这种设计与生成式推荐目标存在偏差:语义嵌入与协同预测弱关联,通用量化在降低自回归建模的序列不确定性方面效率低下。为此,我们提出ReSID——一个面向推荐场景的、原理性SID框架,从信息保持和序列可预测性角度重构表征学习与量化过程,且无需依赖大语言模型。ReSID包含两个核心组件:(1)场感知掩码自编码(FAMAE),从结构化特征中学习预测充分的物品表征;(2)全局对齐正交量化(GAOQ),通过联合降低语义模糊性和前缀条件不确定性,生成紧凑且可预测的SID序列。理论分析和十项数据集的广泛实验验证了ReSID的有效性。该方法在强序列基线及基于SID的生成基线模型上平均提升超过10%,同时将标记化成本降低高达122倍。代码已开源:https://github.com/FuCongResearchSquad/ReSID。
English
Semantic ID (SID)-based recommendation is a promising paradigm for scaling sequential recommender systems, but existing methods largely follow a semantic-centric pipeline: item embeddings are learned from foundation models and discretized using generic quantization schemes. This design is misaligned with generative recommendation objectives: semantic embeddings are weakly coupled with collaborative prediction, and generic quantization is inefficient at reducing sequential uncertainty for autoregressive modeling. To address these, we propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty. Theoretical analysis and extensive experiments across ten datasets show the effectiveness of ReSID. ReSID consistently outperforms strong sequential and SID-based generative baselines by an average of over 10%, while reducing tokenization cost by up to 122x. Code is available at https://github.com/FuCongResearchSquad/ReSID.
PDF412March 12, 2026