TinyStories: 언어 모델이 얼마나 작아도 일관된 영어를 구사할 수 있을까?

초록

언어 모델(Language Models, LMs)은 자연어 처리에 있어 강력한 도구이지만, 모델 규모가 작을 경우 일관되고 유창한 텍스트를 생성하는 데 어려움을 겪는다. GPT-Neo(소형) 또는 GPT-2(소형)와 같이 약 1억 2,500만 개의 매개변수를 가진 모델들은 심도 있는 학습 이후에도 몇 단어를 넘어서는 일관된 영어 텍스트를 생성하기 어렵다. 이는 일관된 영어 텍스트 생성 능력이 더 큰 규모(수억 개 이상의 매개변수)와 복잡한 아키텍처(글로벌 어텐션을 포함한 다층 구조)에서만 나타나는지에 대한 의문을 제기한다. 본 연구에서는 GPT-3.5와 GPT-4로 생성된, 일반적으로 3~4세 아동이 이해할 수 있는 단어들로만 구성된 짧은 이야기들의 합성 데이터셋인 TinyStories를 소개한다. 우리는 TinyStories가 최신 모델들보다 훨씬 작은 규모(총 1,000만 개 미만의 매개변수) 또는 훨씬 단순한 아키텍처(단일 트랜스포머 블록만 포함)를 가진 언어 모델을 훈련하고 평가하는 데 사용될 수 있음을 보여준다. 이러한 모델들은 여전히 여러 문단으로 구성된 다양하고 문법적으로 거의 완벽하며 추론 능력을 보여주는 유창하고 일관된 이야기를 생성할 수 있다. 또한, 우리는 언어 모델 평가를 위한 새로운 패러다임을 제안한다. GPT-4를 활용하여 이러한 모델들이 생성한 콘텐츠를 마치 학생이 작성한 이야기를 (인간) 교사가 채점하듯 평가하는 프레임워크를 제안한다. 이 새로운 패러다임은 모델의 출력이 매우 구조화되어야 하는 기존 벤치마크의 한계를 극복하며, 문법, 창의성, 일관성과 같은 다양한 능력에 대한 다차원적인 점수를 제공한다. 우리는 TinyStories가 특히 저자원 또는 특수 분야에서의 언어 모델 개발, 분석 및 연구를 촉진하고, 언어 모델에서의 언어 능력 발현에 대한 통찰을 제공할 수 있기를 기대한다.

English

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

TinyStories: 언어 모델이 얼마나 작아도 일관된 영어를 구사할 수 있을까?

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

초록

Support