亞當定律：大型語言模型的文本頻率法則

摘要

雖然文本頻率已被證實與人類閱讀速度的認知相關，但其與大型語言模型（LLMs）的關聯性卻鮮少被研究。據我們所知，本文首次針對文本數據頻率這一尚未被充分探討的主題提出新穎的研究方向。我們的框架由三個單元組成：首先，本文提出「文本頻率法則」（TFL），指出無論是提示工程或微調階段，都應優先選用高頻文本數據。由於多數LLMs的訓練數據未公開，我們建議透過線上資源估算句子層級的頻率，並利用輸入改寫器將原始輸入轉換為更高頻的文本表達。其次，我們提出「文本頻率蒸餾」（TFD），透過要求LLMs對數據集中的句子進行故事續寫來擴充語料，並利用生成結果修正初始頻率估算值。最後，我們設計「課程式文本頻率訓練」（CTFT），按照句子頻率由低到高的順序對LLMs進行漸進式微調。實驗基於我們構建的「文本頻率配對數據集」（TFPD）展開，涵蓋數學推理、機器翻譯、常識推理與智能體工具調用等任務，結果驗證了本框架的有效性。

English

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

亞當定律：大型語言模型的文本頻率法則

Adam's Law: Textual Frequency Law on Large Language Models

摘要

Support