Infini-gram: 무한 n-그램 언어 모델을 1조 토큰으로 확장하기

초록

n-gram 언어 모델은 신경망 기반 대형 언어 모델(LLM) 시대에도 여전히 유효한가? 우리의 대답은 '그렇다'이며, 본 논문에서는 텍스트 분석과 신경망 LLM 개선이라는 두 가지 측면에서 n-gram 모델의 가치를 입증한다. 그러나 이를 위해서는 n-gram 모델을 두 가지 측면에서 현대화할 필요가 있다. 첫째, 신경망 LLM과 동일한 데이터 규모인 1.4조 토큰으로 n-gram 모델을 학습시킨다. 이는 지금까지 구축된 가장 큰 n-gram 모델이다. 둘째, 기존 n-gram 모델은 작은 n 값을 사용하여 성능이 제한되는데, 우리는 새로운 infty-gram LM과 백오프를 도입하여 n을 임의로 크게 설정할 수 있도록 한다. n-gram 카운트 테이블을 사전 계산하는 방식(이는 매우 비용이 많이 드는 작업임) 대신, 접미사 배열(suffix array)로 구동되는 infini-gram 엔진을 개발하여 밀리초 수준의 지연 시간으로 infty-gram(뿐만 아니라 임의의 n에 대한 n-gram) 확률을 계산할 수 있도록 했다. infty-gram 프레임워크와 infini-gram 엔진은 인간이 작성한 텍스트와 기계 생성 텍스트에 대한 다양한 새롭고 흥미로운 분석을 가능하게 한다: 우리는 infty-gram LM이 다음 토큰 예측에서 상당히 높은 정확도(47%)를 보이며, 신경망 LLM을 보완하여 언어 모델링 복잡도를 크게 줄일 수 있음을 발견했다. 또한 기계 생성 텍스트를 분석할 때, 접미사 길이에 따른 기계와 infty-gram 간의 일치 수준에서 불규칙성을 관찰했는데, 이는 신경망 LLM 사전 학습과 Transformer의 위치 임베딩에 결함이 있음을 시사한다. 우리는 infini-gram 엔진을 오픈소스로 공개하여 대규모 텍스트 코퍼스에서 검색된 정확한 정보를 최적으로 활용하는 방법에 대한 더 많은 연구가 이루어지기를 기대한다.

English

Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new infty-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute infty-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The infty-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the infty-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--infty-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.

Infini-gram: 무한 n-그램 언어 모델을 1조 토큰으로 확장하기

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

초록

Support