HARE: 小型言語モデルの効率性における人間の事前知識

要旨

人間の事前知識（ヒューマンプライア）は、深層学習においてデータを効率的に活用する上で重要な役割を果たします。しかし、大規模言語モデル（LLMs）の発展に伴い、モデルサイズとデータ量のスケーリングが重視されるようになり、データ構築における人間の事前知識の重要性が薄れつつあります。この傾向の影響を受け、既存の小規模言語モデル（SLMs）は主にウェブスクレイピングによる大規模なトレーニングデータに依存しており、人間の事前知識を適切に取り入れることが軽視されています。この見落としは、リソースが制約された環境における言語モデルのトレーニング効率を制限しています。本論文では、データ構築において人間の事前知識を活用する原則を提案します。この原則は、意味的多様性とデータ品質の一貫性を両立しつつ、ベンチマークデータの漏洩を避けた簡潔なデータセットでトレーニングを行うことで、高性能なSLMsを実現することを重視しています。この原則に従い、HARE-1.1BというSLMをトレーニングしました。大規模なベンチマークデータセットを用いた広範な実験により、HARE-1.1Bが最先端のSLMsに対して優れた性能を示し、提案された原則の有効性が検証されました。さらに、これはリソースが制約された環境における効率的な言語モデルトレーニングについて、人間の事前知識の観点から新たな洞察を提供します。

English

Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.

HARE: 小型言語モデルの効率性における人間の事前知識

HARE: HumAn pRiors, a key to small language model Efficiency

要旨

Support