像素語言模型的文本呈現策略
Text Rendering Strategies for Pixel Language Models
November 1, 2023
作者: Jonas F. Lotz, Elizabeth Salesky, Phillip Rust, Desmond Elliott
cs.AI
摘要
基於像素的語言模型處理以圖像呈現的文本,這使它們能夠處理任何書寫系統,這使其成為開放詞彙語言建模的一種有前途的方法。然而,最近的方法使用產生大量幾乎等效輸入補丁的文本渲染器,這可能對下游任務不利,因為輸入表示中存在冗餘。在本文中,我們研究了四種在 PIXEL 模型中呈現文本的方法(Rust 等人,2023年),發現簡單的字符二元渲染在句子級任務上帶來了改進的性能,而不會影響標記級或多語言任務的性能。這種新的渲染策略還使得可以僅使用 2200 萬參數來訓練一個與原始 8600 萬參數模型性能相當的更緊湊模型。我們的分析表明,字符二元渲染帶來了一個一致性更好的模型,但存在一個由補丁頻率偏差驅動的非各向同性補丁嵌入空間,突顯了基於圖像補丁和基於標記化的語言模型之間的聯繫。
English
Pixel-based language models process text rendered as images, which allows
them to handle any script, making them a promising approach to open vocabulary
language modelling. However, recent approaches use text renderers that produce
a large set of almost-equivalent input patches, which may prove sub-optimal for
downstream tasks, due to redundancy in the input representations. In this
paper, we investigate four approaches to rendering text in the PIXEL model
(Rust et al., 2023), and find that simple character bigram rendering brings
improved performance on sentence-level tasks without compromising performance
on token-level or multilingual tasks. This new rendering strategy also makes it
possible to train a more compact model with only 22M parameters that performs
on par with the original 86M parameter model. Our analyses show that character
bigram rendering leads to a consistently better model but with an anisotropic
patch embedding space, driven by a patch frequency bias, highlighting the
connections between image patch- and tokenization-based language models.