ByteFlow：基於自適應字節壓縮的無分詞器語言建模

摘要

當代語言模型仍依賴固定、預定義的子詞標記化方案。一旦標記器訓練完成，語言模型便只能在此固定粒度層級上運行，這往往導致即便在強推理模型中仍會出現脆弱且違反直覺的行為。我們提出ByteFlow Net——一種新型分層架構，完全摒棄標記器，轉而讓模型能自主將原始字節流分割成語義單元。該架構基於潛表徵的編碼率執行壓縮驅動的分割，通過Top-K選擇在保持靜態計算圖的同時生成自適應邊界。與以往依賴具人為設計歸納偏置的脆弱啟發式方法不同，ByteFlow Net能根據輸入數據自適應調整內部表徵粒度。實驗表明，這種基於壓縮的分塊策略帶來顯著性能提升，ByteFlow Net在表現上均優於基於BPE的Transformer架構及先前字節級模型。這些結果證明，端到端的無標記器建模不僅可行且更具效能，為構建更具適應性與信息基礎的語言模型開闢了新路徑。

English

Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce ByteFlow Net, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries while preserving a static computation graph via Top-K selection. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.

ByteFlow：基於自適應字節壓縮的無分詞器語言建模

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

摘要

Support