微型故事：语言模型可以有多小，但仍能说出连贯的英语？

摘要

语言模型（LMs）是自然语言处理的强大工具，但当它们规模较小时，通常难以生成连贯流畅的文本。拥有约1.25亿参数的模型，如GPT-Neo（小）或GPT-2（小），即使经过大量训练，也很少能生成连贯一致的英文文本，甚至只能达到几个词。这引发了一个问题，即仅在更大规模（数亿参数或更多）和复杂架构（具有多层全局注意力）下才能产生连贯的英文文本能力是否会出现。在这项工作中，我们介绍了TinyStories，这是一个由GPT-3.5和GPT-4生成的短故事的合成数据集，其中仅包含典型3至4岁儿童通常理解的单词。我们展示了TinyStories可用于训练和评估远低于最先进模型（总参数低于1000万）或具有更简单架构（仅有一个变压器块）的LMs，但仍能生成流畅一致、段落丰富多样且几乎完美语法的故事，并展示了推理能力。我们还引入了一种新的语言模型评估范式：我们提出了一个框架，使用GPT-4来评分这些模型生成的内容，就像这些内容是学生写的故事，由（人类）老师评分一样。这种新范式克服了标准基准测试的缺陷，后者通常要求模型的输出非常结构化，而且为模型提供了多维度评分，为不同能力（如语法、创造力和一致性）提供评分。我们希望TinyStories能促进LMs的发展、分析和研究，特别是对于资源匮乏或专业领域，并揭示LMs语言能力的出现。

English

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

微型故事：语言模型可以有多小，但仍能说出连贯的英语？

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

摘要

Support