從字節到概念：基於自迴歸U-Net的語言建模

摘要

分詞化對輸入文本施加了固定的粒度，這限制了語言模型處理數據的方式以及其預測未來的範圍。字節對編碼（BPE）及類似方案一次性分割文本，建立靜態詞彙表，並使模型固守於此選擇。我們通過引入一種自迴歸的U-Net來緩解這種僵化性，該網絡在訓練過程中學會嵌入自己的分詞。網絡讀取原始字節，將其聚合成詞，再成對組合，直至最多四個詞，從而獲得序列的多尺度視角。在更深層次，模型需要預測更遠的未來——預測接下來的幾個詞而非下一個字節——因此更深層次專注於更廣泛的語義模式，而較淺層次則處理細微細節。在精心調控預訓練計算資源的情況下，淺層次結構與強大的BPE基線模型表現相當，而更深層次結構則展現出良好的趨勢。由於分詞化現在內置於模型中，同一系統既能處理字符級任務，也能在低資源語言間傳遞知識。

English

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.

從字節到概念：基於自迴歸U-Net的語言建模

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

摘要

Support