基於原則的合成數據實現推薦系統中大型語言模型的首個規模法則 該研究通過採用符合倫理規範的合成數據生成方法,首次建立了推薦系統領域大型語言模型的規模化規律。這種方法突破了傳統依賴真實用戶數據的局限性,為推薦算法的可擴展性研究開闢了新路徑。研究團隊通過系統性實驗發現,當模型參數、訓練數據量和計算資源呈指數增長時,推薦性能的提升遵循可預測的冪律分佈。這一突破性進展不僅驗證了合成數據在複雜推薦場景中的有效性,更為構建可解釋、可控制的推薦系統提供了理論基礎。值得注意的是,該規模法則在不同領域的推薦任務中均展現出普適性,表明大型語言模型在個性化推薦領域具有巨大的規模化潛力。
Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation
February 7, 2026
作者: Benyu Zhang, Qiang Zhang, Jianpeng Cheng, Hong-You Chen, Qifei Wang, Wei Sun, Shen Li, Jia Li, Jiahao Wu, Xiangjun Fan, Hong Yan
cs.AI
摘要
大型语言模型(LLMs)为推荐系统展现了广阔前景,但其发展一直受限于可预测缩放规律的缺失——这一规律对指导研究和优化资源配置至关重要。我们认为,这或许源于以往持续预训练(CPT)中原始用户交互数据固有的噪声、偏差和不完整性。本文提出了一种新颖的分层框架,通过为LLM构建精心设计的教学课程来生成高质量合成数据,从而规避上述问题。我们通过实证表明:在下游排序任务中,基于我们理论化合成数据训练的标准序列模型(如SasRec的召回率@100提升130%)显著优于真实数据训练的模型,这为课程数据的效用提供了强有力的直接证据,证明其在学习可泛化用户偏好模式方面的优越性。基于此,我们首次通过实验证明:在使用我们高质量推荐专用数据持续预训练的LLM中,存在稳健的幂律缩放规律。实验表明,多种合成数据模态均能实现一致且可预测的困惑度降低。这些发现为推荐领域LLM能力的可靠扩展奠定了方法论基础,从而将研究重点从缓解数据缺陷转向利用高质量结构化信息。
English
Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform (+130% on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.