悟空：朝向大规模推荐的扩展定律

摘要

在模型品質的可持續改善中，規模律扮演著重要角色。不幸的是，迄今為止的推薦模型並未展現出類似於大型語言模型領域觀察到的規模律，這是由於它們的擴展機制效率不高所致。這種限制在將這些模型適應日益複雜的現實世界數據集時帶來了重大挑戰。在本文中，我們提出了一種基於純粹堆疊分解機制的有效網絡架構，以及一種協同擴展策略，統稱為Wukong，以在推薦領域建立規模律。Wukong的獨特設計使其能夠通過更高更寬的層簡單地捕捉各種任意次序的交互作用。我們在六個公共數據集上進行了廣泛評估，結果顯示Wukong在品質方面始終優於最先進的模型。此外，我們在一個內部的大規模數據集上評估了Wukong的可擴展性。結果顯示，Wukong在品質上保持優勢，同時在模型複雜度的兩個數量級範圍內保持規模律，超過100 Gflop，或者相當於GPT-3/LLaMa-2總訓練計算規模，優於先前的技術。

English

Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 Gflop or equivalently up to GPT-3/LLaMa-2 scale of total training compute, where prior arts fall short.

悟空：朝向大规模推荐的扩展定律

Wukong: Towards a Scaling Law for Large-Scale Recommendation

摘要

Support