從字節到概念:基於自迴歸U-Net的語言建模
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
June 17, 2025
作者: Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz
cs.AI
摘要
分詞化對輸入文本施加了固定的粒度,這限制了語言模型處理數據的方式以及其預測未來的範圍。字節對編碼(BPE)及類似方案一次性分割文本,建立靜態詞彙表,並使模型固守於此選擇。我們通過引入一種自迴歸的U-Net來緩解這種僵化性,該網絡在訓練過程中學會嵌入自己的分詞。網絡讀取原始字節,將其聚合成詞,再成對組合,直至最多四個詞,從而獲得序列的多尺度視角。在更深層次,模型需要預測更遠的未來——預測接下來的幾個詞而非下一個字節——因此更深層次專注於更廣泛的語義模式,而較淺層次則處理細微細節。在精心調控預訓練計算資源的情況下,淺層次結構與強大的BPE基線模型表現相當,而更深層次結構則展現出良好的趨勢。由於分詞化現在內置於模型中,同一系統既能處理字符級任務,也能在低資源語言間傳遞知識。
English
Tokenization imposes a fixed granularity on the input text, freezing how a
language model operates on data and how far in the future it predicts. Byte
Pair Encoding (BPE) and similar schemes split text once, build a static
vocabulary, and leave the model stuck with that choice. We relax this rigidity
by introducing an autoregressive U-Net that learns to embed its own tokens as
it trains. The network reads raw bytes, pools them into words, then pairs of
words, then up to 4 words, giving it a multi-scale view of the sequence. At
deeper stages, the model must predict further into the future -- anticipating
the next few words rather than the next byte -- so deeper stages focus on
broader semantic patterns while earlier stages handle fine details. When
carefully tuning and controlling pretraining compute, shallow hierarchies tie
strong BPE baselines, and deeper hierarchies have a promising trend. Because
tokenization now lives inside the model, the same system can handle
character-level tasks and carry knowledge across low-resource languages.