HARE:人类先验,小型语言模型效率的关键
HARE: HumAn pRiors, a key to small language model Efficiency
June 17, 2024
作者: Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu
cs.AI
摘要
人类先验在深度学习中高效利用数据中扮演着至关重要的角色。然而,随着大型语言模型(LLMs)的发展,越来越强调模型规模和数据量的扩展,这往往减弱了人类先验在数据构建中的重要性。受这些趋势影响,现有的小型语言模型(SLMs)主要依赖于网络抓取的大规模训练数据,忽视了正确整合人类先验的重要性。这一疏忽限制了语言模型在资源受限环境中的训练效率。本文提出了一项利用人类先验进行数据构建的原则。该原则强调通过在包含语义多样性和数据质量一致性的简明数据集上训练,避免基准数据泄漏,从而实现高性能SLMs。根据这一原则,我们训练了一个名为HARE-1.1B的SLM。对大规模基准数据集进行的大量实验表明,HARE-1.1B在表现上优于最先进的SLMs,验证了所提原则的有效性。此外,从人类先验的角度为资源受限环境中的高效语言模型训练提供了新的见解。
English
Human priors play a crucial role in efficiently utilizing data in deep
learning. However, with the development of large language models (LLMs), there
is an increasing emphasis on scaling both model size and data volume, which
often diminishes the importance of human priors in data construction.
Influenced by these trends, existing Small Language Models (SLMs) mainly rely
on web-scraped large-scale training data, neglecting the proper incorporation
of human priors. This oversight limits the training efficiency of language
models in resource-constrained settings. In this paper, we propose a principle
to leverage human priors for data construction. This principle emphasizes
achieving high-performance SLMs by training on a concise dataset that
accommodates both semantic diversity and data quality consistency, while
avoiding benchmark data leakage. Following this principle, we train an SLM
named HARE-1.1B. Extensive experiments on large-scale benchmark datasets
demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs,
validating the effectiveness of the proposed principle. Additionally, this
provides new insights into efficient language model training in
resource-constrained environments from the view of human priors.Summary
AI-Generated Summary