像素语言模型的文本渲染策略

摘要

基于像素的语言模型处理以图像形式呈现的文本，这使得它们能够处理任何脚本，从而成为开放词汇语言建模的一种有前途的方法。然而，最近的方法使用生成大量几乎等效输入补丁的文本渲染器，这可能对下游任务不利，因为输入表示中存在冗余。在本文中，我们研究了在PIXEL模型中渲染文本的四种方法（Rust等，2023年），发现简单的字符二元渲染在句子级任务上表现更好，而在标记级或多语言任务上不会降低性能。这种新的渲染策略还使得可以训练一个只有22M参数的更紧凑模型，其性能与原始的86M参数模型相当。我们的分析表明，字符二元渲染导致一个一贯更好的模型，但具有各向异性的补丁嵌入空间，受到补丁频率偏差的驱动，突显了基于图像补丁和基于标记化的语言模型之间的联系。

English

Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large set of almost-equivalent input patches, which may prove sub-optimal for downstream tasks, due to redundancy in the input representations. In this paper, we investigate four approaches to rendering text in the PIXEL model (Rust et al., 2023), and find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks. This new rendering strategy also makes it possible to train a more compact model with only 22M parameters that performs on par with the original 86M parameter model. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias, highlighting the connections between image patch- and tokenization-based language models.

像素语言模型的文本渲染策略

Text Rendering Strategies for Pixel Language Models

摘要

Support