从字节到思想:基于自回归U-Net的语言建模
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
June 17, 2025
作者: Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz
cs.AI
摘要
分词处理为输入文本设定了固定的粒度,限定了语言模型处理数据的方式及其预测未来的范围。字节对编码(BPE)及类似方案一次性分割文本,构建静态词汇表,使模型受限于此选择。我们通过引入一种自回归的U-Net网络来缓解这种僵化,该网络在训练过程中学习嵌入自身的标记。网络读取原始字节,将其聚合成单词,再组合成词组,直至最多四个单词,从而获得序列的多尺度视角。在更深层次,模型需预测更远的未来——预见接下来的几个词而非单个字节——因此深层阶段聚焦于更广泛的语义模式,而早期阶段则处理细节。通过精心调整和控制预训练计算,浅层结构与强大的BPE基线表现相当,而更深层次结构展现出积极趋势。由于分词过程现内置于模型中,同一系统既能处理字符级任务,也能在低资源语言间传递知识。
English
Tokenization imposes a fixed granularity on the input text, freezing how a
language model operates on data and how far in the future it predicts. Byte
Pair Encoding (BPE) and similar schemes split text once, build a static
vocabulary, and leave the model stuck with that choice. We relax this rigidity
by introducing an autoregressive U-Net that learns to embed its own tokens as
it trains. The network reads raw bytes, pools them into words, then pairs of
words, then up to 4 words, giving it a multi-scale view of the sequence. At
deeper stages, the model must predict further into the future -- anticipating
the next few words rather than the next byte -- so deeper stages focus on
broader semantic patterns while earlier stages handle fine details. When
carefully tuning and controlling pretraining compute, shallow hierarchies tie
strong BPE baselines, and deeper hierarchies have a promising trend. Because
tokenization now lives inside the model, the same system can handle
character-level tasks and carry knowledge across low-resource languages.