参见文本:从分词到视觉阅读
See the Text: From Tokenization to Visual Reading
October 21, 2025
作者: Ling Xing, Alex Jinpeng Wang, Rui Yan, Hongyu Qu, Zechao Li, Jinhui Tang
cs.AI
摘要
人类通过将文字视为视觉对象来阅读,包括其形状、布局和模式,随后将其与意义关联,这使得我们能够有效处理拼写错误、变形字体及多种文字体系。然而,现代大型语言模型(LLMs)依赖于子词分词技术,将文本从固定词汇表中分割成片段。尽管这对高资源语言行之有效,但此方法对低资源语言过度分割,产生冗长且语言学上无意义的序列,并增加了计算负担。在本研究中,我们挑战这一根深蒂固的范式,转向以视觉为中心的替代方案。我们的方法——SeeTok,将文本渲染为图像(视觉文本),并利用预训练的多模态LLMs进行解读,复用从大规模多模态训练中学到的强大OCR和文本-视觉对齐能力。在三种不同的语言任务中,SeeTok与子词分词器表现相当或更优,同时所需token数量减少了4.43倍,计算量(FLOPs)降低了70.5%,并在跨语言泛化、对排版噪声的鲁棒性及语言层级结构上取得额外优势。SeeTok标志着从符号化分词向类人视觉阅读的转变,迈向了更自然、更受认知启发的语言模型。
English
People see text. Humans read by recognizing words as visual objects,
including their shapes, layouts, and patterns, before connecting them to
meaning, which enables us to handle typos, distorted fonts, and various scripts
effectively. Modern large language models (LLMs), however, rely on subword
tokenization, fragmenting text into pieces from a fixed vocabulary. While
effective for high-resource languages, this approach over-segments low-resource
languages, yielding long, linguistically meaningless sequences and inflating
computation. In this work, we challenge this entrenched paradigm and move
toward a vision-centric alternative. Our method, SeeTok, renders text as images
(visual-text) and leverages pretrained multimodal LLMs to interpret them,
reusing strong OCR and text-vision alignment abilities learned from large-scale
multimodal training. Across three different language tasks, SeeTok matches or
surpasses subword tokenizers while requiring 4.43 times fewer tokens and
reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization,
robustness to typographic noise, and linguistic hierarchy. SeeTok signals a
shift from symbolic tokenization to human-like visual reading, and takes a step
toward more natural and cognitively inspired language models.