微故事:語言模型可以有多小,仍能說出通順的英語?
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
May 12, 2023
作者: Ronen Eldan, Yuanzhi Li
cs.AI
摘要
語言模型(LMs)是自然語言處理的強大工具,但當它們較小時,通常難以產生連貫和流暢的文本。具有約125M參數的模型,如GPT-Neo(小型)或GPT-2(小型),即使經過大量訓練,也很少能生成超出幾個詞的連貫和一致的英文文本。這引發了一個問題,即是否只有在更大規模(數億參數或更多)和複雜架構(具有多層全局關注)下,才會出現產生連貫英文文本的能力。
在這項工作中,我們介紹TinyStories,這是一個由GPT-3.5和GPT-4生成的短故事的合成數據集,其中僅包含典型3至4歲兒童通常理解的單詞。我們展示了TinyStories可以用於訓練和評估比最先進模型小得多(總參數少於1000萬)或具有更簡單架構(僅有一個變壓器塊)的LMs,但仍能生成流暢和一致的故事,這些故事有幾段,內容多樣,幾乎完美的語法,並展示了推理能力。
我們還提出了一種評估語言模型的新範式:我們提出一個框架,使用GPT-4來評分這些模型生成的內容,就像這些是學生寫的故事,由(人類)老師評分一樣。這種新範式克服了標準基準的缺陷,後者通常要求模型的輸出非常結構化,並且為模型提供多維度的分數,為不同能力(如語法、創造力和一致性)提供分數。
我們希望TinyStories可以促進LMs的開發、分析和研究,特別是對於資源匱乏或專業領域,並揭示LMs語言能力的出現。
English
Language models (LMs) are powerful tools for natural language processing, but
they often struggle to produce coherent and fluent text when they are small.
Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can
rarely generate coherent and consistent English text beyond a few words even
after extensive training. This raises the question of whether the emergence of
the ability to produce coherent English text only occurs at larger scales (with
hundreds of millions of parameters or more) and complex architectures (with
many layers of global attention).
In this work, we introduce TinyStories, a synthetic dataset of short stories
that only contain words that a typical 3 to 4-year-olds usually understand,
generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train
and evaluate LMs that are much smaller than the state-of-the-art models (below
10 million total parameters), or have much simpler architectures (with only one
transformer block), yet still produce fluent and consistent stories with
several paragraphs that are diverse and have almost perfect grammar, and
demonstrate reasoning capabilities.
We also introduce a new paradigm for the evaluation of language models: We
suggest a framework which uses GPT-4 to grade the content generated by these
models as if those were stories written by students and graded by a (human)
teacher. This new paradigm overcomes the flaws of standard benchmarks which
often requires the model's output to be very structures, and moreover provides
a multidimensional score for the model, providing scores for different
capabilities such as grammar, creativity and consistency.
We hope that TinyStories can facilitate the development, analysis and
research of LMs, especially for low-resource or specialized domains, and shed
light on the emergence of language capabilities in LMs.