原则性合成数据首次实现推荐系统大语言模型规模效应定律
Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation
February 7, 2026
作者: Benyu Zhang, Qiang Zhang, Jianpeng Cheng, Hong-You Chen, Qifei Wang, Wei Sun, Shen Li, Jia Li, Jiahao Wu, Xiangjun Fan, Hong Yan
cs.AI
摘要
大型语言模型为推荐系统开辟了前景广阔的新路径,但其发展长期受制于可预测扩展规律的缺失——这种规律对指导研究和优化资源配置至关重要。我们认为,先前持续预训练研究中原始用户交互数据固有的噪声、偏差与不完整性可能是导致该问题的根源。本文提出一种创新的分层框架,通过为LLM构建精心设计的教学课程来生成高质量合成数据,从而规避上述问题。我们通过实证证明:基于规范合成数据训练的标准序列模型在下游排序任务中显著优于真实数据训练的模型(SasRec的召回率@100提升130%),这为课程体系的有效性提供了有力证据,表明其能更好地学习可泛化的用户偏好模式。在此基础上,我们首次通过实验验证了基于高质量推荐专用数据持续预训练的LLM具有稳健的幂律扩展特性。实验表明,在多模态合成数据上模型困惑度均呈现一致且可预测的下降趋势。这些发现为推荐领域LLM能力的可靠扩展奠定了方法论基础,从而将研究重点从缓解数据缺陷转向利用高质量结构化信息。
English
Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform (+130% on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.