端到端層次序列建模中的動態分塊技術

摘要

尽管近年来语言模型（LMs）取得了令人瞩目的进展，这主要归功于从为特定任务设计的专用模型转向基于强大架构（如Transformer）的通用模型，这些模型从原始数据中学习一切，但诸如分词等预处理步骤仍然是实现真正端到端基础模型的障碍。我们引入了一系列新技术，这些技术实现了一种动态分块机制，能够自动学习内容及上下文依赖的分割策略，并与模型的其他部分联合学习。将这一机制整合到一个显式的层次网络（H-Net）中，可以替代（隐含层次化的）分词-语言模型-去分词流程，用一个完全端到端学习的单一模型取而代之。在计算资源和数据量相匹配的情况下，一个在字节级别操作的单一层次H-Net，其表现优于基于BPE分词的强大Transformer语言模型。通过迭代增加层次结构以建模多级抽象，H-Net的性能进一步提升，显示出显著优于数据规模的增长，并与两倍大小的基于分词的Transformer模型相匹敌。在英语上预训练的H-Nets展现出显著增强的字符级鲁棒性，并在无任何启发式方法或显式监督的情况下，定性学习到有意义的数据依赖分块策略。最后，在分词启发式方法较弱的中文、代码或DNA序列等语言和模态中，H-Net相对于分词流程的改进更为显著（数据效率较基线提升近4倍），展示了从未经处理的数据中更好学习和扩展的真正端到端模型的潜力。

English

Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content -- and context -- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching a token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

端到端層次序列建模中的動態分塊技術

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

摘要

Support