ChatPaper.aiChatPaper

參見文本:從分詞到視覺閱讀

See the Text: From Tokenization to Visual Reading

October 21, 2025
作者: Ling Xing, Alex Jinpeng Wang, Rui Yan, Hongyu Qu, Zechao Li, Jinhui Tang
cs.AI

摘要

人們視文字為視覺對象,通過識別其形狀、佈局和模式來閱讀,並將其與意義相連接,從而有效處理錯別字、變形字體及多種文字系統。然而,現代大型語言模型(LLMs)依賴於子詞分詞技術,將文本從固定詞彙表中分割成片段。雖然這種方法對高資源語言有效,但對低資源語言則過度分割,產生冗長且無語言學意義的序列,並增加了計算量。在本研究中,我們挑戰這一根深蒂固的範式,轉向以視覺為中心的替代方案。我們的方法SeeTok將文本渲染為圖像(視覺文本),並利用預訓練的多模態LLMs來解讀這些圖像,重用了從大規模多模態訓練中學到的強大OCR和文本-視覺對齊能力。在三種不同的語言任務中,SeeTok與子詞分詞器相比,不僅匹配甚至超越了其性能,同時所需的分詞數量減少了4.43倍,浮點運算次數(FLOPs)降低了70.5%,並在跨語言泛化、對印刷噪聲的魯棒性以及語言層次結構方面取得了額外的增益。SeeTok標誌著從符號分詞向類人視覺閱讀的轉變,邁向了更自然、更受認知啟發的語言模型。
English
People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
PDF11October 23, 2025