从字节到思想：基于自回归U-Net的语言建模

摘要

分词处理为输入文本设定了固定的粒度，限定了语言模型处理数据的方式及其预测未来的范围。字节对编码（BPE）及类似方案一次性分割文本，构建静态词汇表，使模型受限于此选择。我们通过引入一种自回归的U-Net网络来缓解这种僵化，该网络在训练过程中学习嵌入自身的标记。网络读取原始字节，将其聚合成单词，再组合成词组，直至最多四个单词，从而获得序列的多尺度视角。在更深层次，模型需预测更远的未来——预见接下来的几个词而非单个字节——因此深层阶段聚焦于更广泛的语义模式，而早期阶段则处理细节。通过精心调整和控制预训练计算，浅层结构与强大的BPE基线表现相当，而更深层次结构展现出积极趋势。由于分词过程现内置于模型中，同一系统既能处理字符级任务，也能在低资源语言间传递知识。

English

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.

从字节到思想：基于自回归U-Net的语言建模

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

摘要

Support