像素語言模型的文本呈現策略

摘要

基於像素的語言模型處理以圖像呈現的文本，這使它們能夠處理任何書寫系統，這使其成為開放詞彙語言建模的一種有前途的方法。然而，最近的方法使用產生大量幾乎等效輸入補丁的文本渲染器，這可能對下游任務不利，因為輸入表示中存在冗餘。在本文中，我們研究了四種在 PIXEL 模型中呈現文本的方法（Rust 等人，2023年），發現簡單的字符二元渲染在句子級任務上帶來了改進的性能，而不會影響標記級或多語言任務的性能。這種新的渲染策略還使得可以僅使用 2200 萬參數來訓練一個與原始 8600 萬參數模型性能相當的更緊湊模型。我們的分析表明，字符二元渲染帶來了一個一致性更好的模型，但存在一個由補丁頻率偏差驅動的非各向同性補丁嵌入空間，突顯了基於圖像補丁和基於標記化的語言模型之間的聯繫。

English

Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large set of almost-equivalent input patches, which may prove sub-optimal for downstream tasks, due to redundancy in the input representations. In this paper, we investigate four approaches to rendering text in the PIXEL model (Rust et al., 2023), and find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks. This new rendering strategy also makes it possible to train a more compact model with only 22M parameters that performs on par with the original 86M parameter model. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias, highlighting the connections between image patch- and tokenization-based language models.

像素語言模型的文本呈現策略

Text Rendering Strategies for Pixel Language Models

摘要

Support