端到端层次序列建模中的动态分块技术

摘要

尽管近年来语言模型（LMs）取得了令人瞩目的进展，这主要归功于从针对特定任务设计的专用模型转向基于强大架构（如Transformer）的通用模型，这些模型从原始数据中学习一切，但诸如分词等预处理步骤仍然是实现真正端到端基础模型的障碍。我们引入了一系列新技术，这些技术实现了一种动态分块机制，能够自动学习内容及上下文依赖的分割策略，并与模型的其他部分联合学习。将这一机制整合到一个显式的层次化网络（H-Net）中，可以替代（隐含层次化的）分词-语言模型-去分词流程，用一个完全端到端学习的单一模型取而代之。在计算资源和数据量相匹配的情况下，一个在字节级别上运作的单层H-Net，其表现优于基于BPE分词的强大Transformer语言模型。通过迭代增加层次结构至多级，H-Net能够建模多个抽象层次，进一步提升了性能，展现出显著优于数据规模的增长，并与规模为其两倍的分词Transformer模型相匹敌。在英语上预训练的H-Nets显著增强了字符级别的鲁棒性，并在无任何启发式规则或显式监督的情况下，定性学习了有意义的数据依赖分块策略。最后，在分词启发式较弱语言和模态中，如中文、代码或DNA序列（相较于基线，数据效率提升近4倍），H-Net相对于分词流程的改进更为显著，展示了真正端到端模型从未处理数据中更好学习与扩展的潜力。

English

Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content -- and context -- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching a token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

端到端层次序列建模中的动态分块技术

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

摘要

Support