微型故事:语言模型可以有多小,但仍能说出连贯的英语?
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
May 12, 2023
作者: Ronen Eldan, Yuanzhi Li
cs.AI
摘要
语言模型(LMs)是自然语言处理的强大工具,但当它们规模较小时,通常难以生成连贯流畅的文本。拥有约1.25亿参数的模型,如GPT-Neo(小)或GPT-2(小),即使经过大量训练,也很少能生成连贯一致的英文文本,甚至只能达到几个词。这引发了一个问题,即仅在更大规模(数亿参数或更多)和复杂架构(具有多层全局注意力)下才能产生连贯的英文文本能力是否会出现。
在这项工作中,我们介绍了TinyStories,这是一个由GPT-3.5和GPT-4生成的短故事的合成数据集,其中仅包含典型3至4岁儿童通常理解的单词。我们展示了TinyStories可用于训练和评估远低于最先进模型(总参数低于1000万)或具有更简单架构(仅有一个变压器块)的LMs,但仍能生成流畅一致、段落丰富多样且几乎完美语法的故事,并展示了推理能力。
我们还引入了一种新的语言模型评估范式:我们提出了一个框架,使用GPT-4来评分这些模型生成的内容,就像这些内容是学生写的故事,由(人类)老师评分一样。这种新范式克服了标准基准测试的缺陷,后者通常要求模型的输出非常结构化,而且为模型提供了多维度评分,为不同能力(如语法、创造力和一致性)提供评分。
我们希望TinyStories能促进LMs的发展、分析和研究,特别是对于资源匮乏或专业领域,并揭示LMs语言能力的出现。
English
Language models (LMs) are powerful tools for natural language processing, but
they often struggle to produce coherent and fluent text when they are small.
Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can
rarely generate coherent and consistent English text beyond a few words even
after extensive training. This raises the question of whether the emergence of
the ability to produce coherent English text only occurs at larger scales (with
hundreds of millions of parameters or more) and complex architectures (with
many layers of global attention).
In this work, we introduce TinyStories, a synthetic dataset of short stories
that only contain words that a typical 3 to 4-year-olds usually understand,
generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train
and evaluate LMs that are much smaller than the state-of-the-art models (below
10 million total parameters), or have much simpler architectures (with only one
transformer block), yet still produce fluent and consistent stories with
several paragraphs that are diverse and have almost perfect grammar, and
demonstrate reasoning capabilities.
We also introduce a new paradigm for the evaluation of language models: We
suggest a framework which uses GPT-4 to grade the content generated by these
models as if those were stories written by students and graded by a (human)
teacher. This new paradigm overcomes the flaws of standard benchmarks which
often requires the model's output to be very structures, and moreover provides
a multidimensional score for the model, providing scores for different
capabilities such as grammar, creativity and consistency.
We hope that TinyStories can facilitate the development, analysis and
research of LMs, especially for low-resource or specialized domains, and shed
light on the emergence of language capabilities in LMs.