アダムの法則：大規模言語モデルにおけるテキスト頻度の法則

要旨

テキスト頻度は読解速度における人間の認知に関連することが実証されているが、大規模言語モデル（LLM）との関連性はほとんど研究されていない。我々は、知る限りでは未開拓の研究分野であるテキストデータ頻度に着目した新たな研究方向性を提案する。本枠組みは3つの要素で構成される。まず、頻出するテキストデータがLLMのプロンプティングとファインチューニングの両方で優先されるべきであることを示す「テキスト頻度法則（TFL）」を提案する。多くのLLMの学習データは非公開であるため、オンラインリソースを用いた文レベルの頻度推定手法を考案する。さらに入力パラフレーザーを用いて、入力文をより頻出する表現へ言い換える。次に、LLMにデータセットの文を拡張させた物語完成課題を実行させる「テキスト頻度蒸留（TFD）」を提案し、生成されたコーパスで初期推定値を補正する。最後に、文レベルの頻度が低い順にLLMをファインチューニングする「カリキュラムテキスト頻度学習（CTFT）」を提案する。数学推論、機械翻訳、常識推論、エージェント的ツール呼び出しのタスクにおいて構築したデータセット「Textual Frequency Paired Dataset（TFPD）」で実験を実施した結果、本枠組みの有効性が確認された。

English

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

アダムの法則：大規模言語モデルにおけるテキスト頻度の法則

Adam's Law: Textual Frequency Law on Large Language Models

要旨

Support