ChatPaper.aiChatPaper

迈向通用人工智能的数据科学与技术(上):分层数据管理体系

Data Science and Technology Towards AGI Part I: Tiered Data Management

February 9, 2026
作者: Yudong Wang, Zixuan Fu, Hengyu Zhao, Chen Zhao, Chuyue Zhou, Xinle Lin, Hongya Lyu, Shuaikang Xue, Yi Yi, Yingjiao Wang, Zhi Zheng, Yuzhou Zhang, Jie Zhou, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

人工智能的发展可被视为数据驱动学习范式的演进过程,数据组织与利用方式的迭代升级持续推动着模型能力的进步。当前大语言模型研究主要依赖数据规模单向扩张的范式,日益面临数据可获得性、采集成本和训练效率的瓶颈。本文提出通用人工智能发展正进入数据-模型协同演进的新阶段:模型主动指导数据管理,而高质量数据又反哺模型能力提升。为实现这一愿景,我们设计了支持异构学习目标和成本约束下全周期大模型训练的分层数据管理框架。具体而言,我们构建了L0-L4五级数据管理体系,涵盖从原始未筛选资源到可验证结构化知识的完整谱系。该框架创新性地将大模型全面应用于质量评分、内容编辑等数据管理环节,实现跨层数据精炼。每个层级具有独特的数据属性、管理策略和训练职能,支持数据在预训练、中期训练和对齐阶段进行战略配置。该框架通过平衡数据质量、获取成本与边际训练收益,为可扩展可持续的数据管理提供系统化解决方案。我们通过实证研究验证框架有效性:从原始语料构建分级数据集并应用于多阶段训练,实验结果表明分级数据利用能显著提升训练效率和模型性能。为促进相关研究,我们向社区开源分级数据集与处理工具。
English
The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.
PDF52February 11, 2026