종단계 계층적 시퀀스 모델링을 위한 동적 청킹

초록

최근 몇 년 동안 언어 모델(LMs)에서 놀라운 발전이 이루어졌으며, 이는 특정 작업을 위해 설계된 전문화된 모델에서 벗어나 강력한 아키텍처(예: Transformer)를 기반으로 원시 데이터에서 모든 것을 학습하는 일반 모델로 전환한 결과로 크게 기인합니다. 그러나 토큰화와 같은 전처리 단계는 진정한 엔드투엔드 기반 모델로 나아가는 데 있어 여전히 장벽으로 남아 있습니다. 본 연구에서는 모델의 나머지 부분과 함께 학습되는 콘텐츠 및 컨텍스트에 의존한 세그멘테이션 전략을 자동으로 학습하는 동적 청킹 메커니즘을 가능하게 하는 새로운 기술들을 소개합니다. 이를 명시적 계층적 네트워크(H-Net)에 통합함으로써 (암묵적으로 계층적인) 토큰화-LM-디토큰화 파이프라인을 완전히 엔드투엔드로 학습된 단일 모델로 대체할 수 있습니다. 컴퓨팅 및 데이터가 동일한 조건에서, 바이트 수준에서 작동하는 한 단계의 계층을 가진 H-Net은 BPE 토큰을 사용하는 강력한 Transformer 언어 모델을 능가합니다. 계층을 여러 단계로 반복함으로써 다중 수준의 추상화를 모델링하여 성능을 더욱 향상시키며, 데이터에 대한 더 나은 확장성을 보여주고 크기가 두 배인 토큰 기반 Transformer와 동등한 성능을 달성합니다. 영어로 사전 학습된 H-Net은 문자 수준의 견고성이 크게 증가하며, 어떠한 휴리스틱이나 명시적 감독 없이도 의미 있는 데이터 의존적 청킹 전략을 질적으로 학습합니다. 마지막으로, H-Net의 토큰화 파이프라인에 대한 개선은 중국어, 코드, 또는 DNA 시퀀스와 같이 토큰화 휴리스틱이 약한 언어 및 모달리티에서 더욱 증가하며(기준선 대비 데이터 효율성에서 거의 4배의 개선), 처리되지 않은 데이터에서 더 잘 학습하고 확장할 수 있는 진정한 엔드투엔드 모델의 잠재력을 보여줍니다.

English

Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content -- and context -- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching a token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

종단계 계층적 시퀀스 모델링을 위한 동적 청킹

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

초록

Support