JPEG-LM: 正規化コーデック表現を用いた画像生成器としての大規模言語モデル

要旨

近年の画像および動画生成の研究では、その汎用性とマルチモーダルシステムへの容易な統合可能性から、自己回帰型LLMアーキテクチャが採用されています。言語生成における自己回帰型トレーニングを視覚生成に適用する際の鍵は、離散化——つまり、画像や動画といった連続データを離散トークンとして表現すること——にあります。画像や動画を離散化する一般的な方法には、膨大な長さとなる生のピクセル値をモデル化する方法や、複雑な事前トレーニングを必要とするベクトル量子化があります。本研究では、画像や動画を標準的なコーデック（例：JPEG、AVC/H.264）を用いて保存された圧縮ファイルとして直接モデル化することを提案します。視覚専用の変更を加えずにデフォルトのLlamaアーキテクチャを使用し、JPEGおよびAVC形式の圧縮ファイルのバイトを直接出力するJPEG-LM（および概念実証として動画生成を行うAVC-LM）をゼロから事前トレーニングします。画像生成の評価では、このシンプルで直接的なアプローチが、ピクセルベースのモデル化や洗練されたベクトル量子化のベースラインよりも効果的であることが示されました（本手法ではFIDが31%削減されました）。分析によると、JPEG-LMは、特にロングテールの視覚要素を生成する点でベクトル量子化モデルよりも優れています。全体として、標準的なコーデック表現を使用することで、言語生成と視覚生成の間の障壁を低くし、マルチモーダルな言語/画像/動画LLMの今後の研究を促進できることを示しています。

English

Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization -- representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

JPEG-LM: 正規化コーデック表現を用いた画像生成器としての大規模言語モデル

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

要旨

Support