픽셀 언어 모델을 위한 텍스트 렌더링 전략

초록

픽셀 기반 언어 모델은 텍스트를 이미지로 렌더링하여 처리함으로써 모든 문자 체계를 다룰 수 있어, 개방형 어휘 언어 모델링에 유망한 접근법으로 간주됩니다. 그러나 최근의 접근법은 거의 동등한 입력 패치를 대량으로 생성하는 텍스트 렌더러를 사용하는데, 이는 입력 표현의 중복성으로 인해 다운스트림 작업에 있어 최적이 아닐 수 있습니다. 본 논문에서는 PIXEL 모델(Rust et al., 2023)에서 텍스트를 렌더링하는 네 가지 접근법을 조사하였고, 간단한 문자 바이그램 렌더링이 토큰 수준 또는 다국어 작업의 성능을 저하시키지 않으면서 문장 수준 작업에서 향상된 성능을 가져온다는 것을 발견했습니다. 이 새로운 렌더링 전략은 원래 86M 파라미터 모델과 동등한 성능을 보이는 22M 파라미터의 더 컴팩트한 모델을 훈련할 수 있게 해줍니다. 우리의 분석은 문자 바이그램 렌더링이 패치 빈도 편향에 의해 주도되는 이방성 패치 임베딩 공간을 가진 일관되게 더 나은 모델로 이어지지만, 이는 이미지 패치 기반과 토큰화 기반 언어 모델 간의 연결성을 강조합니다.

English

Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large set of almost-equivalent input patches, which may prove sub-optimal for downstream tasks, due to redundancy in the input representations. In this paper, we investigate four approaches to rendering text in the PIXEL model (Rust et al., 2023), and find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks. This new rendering strategy also makes it possible to train a more compact model with only 22M parameters that performs on par with the original 86M parameter model. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias, highlighting the connections between image patch- and tokenization-based language models.

픽셀 언어 모델을 위한 텍스트 렌더링 전략

Text Rendering Strategies for Pixel Language Models

초록

Support