JPEG-LM:LLM作为具有规范编解码器表示的图像生成器
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
August 15, 2024
作者: Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov
cs.AI
摘要
最近在图像和视频生成领域,由于其通用性和潜在易于集成到多模态系统中,开始采用自回归LLM架构。将自回归训练在语言生成中应用到视觉生成的关键在于离散化,即将诸如图像和视频之类的连续数据表示为离散标记。离散化图像和视频的常见方法包括对原始像素值进行建模,这是非常冗长的,或者使用矢量量化,这需要复杂的预先训练。在这项工作中,我们提出直接将图像和视频建模为通过经典编解码器(例如JPEG、AVC/H.264)保存在计算机上的压缩文件。使用默认的Llama架构,无需任何视觉特定修改,我们从头开始预训练JPEG-LM以生成图像(以AVC-LM生成视频作为概念验证),直接输出JPEG和AVC格式的压缩文件字节。图像生成的评估表明,这种简单直接的方法比基于像素建模和复杂的矢量量化基线更有效(我们的方法在其中实现了FID减少31%)。我们的分析显示,JPEG-LM在生成长尾视觉元素方面比矢量量化模型具有特殊优势。总体而言,我们展示了使用经典编解码器表示可以有助于降低语言生成和视觉生成之间的障碍,促进未来多模态语言/图像/视频LLM研究。
English
Recent work in image and video generation has been adopting the
autoregressive LLM architecture due to its generality and potentially easy
integration into multi-modal systems. The crux of applying autoregressive
training in language generation to visual generation is discretization --
representing continuous data like images and videos as discrete tokens. Common
methods of discretizing images and videos include modeling raw pixel values,
which are prohibitively lengthy, or vector quantization, which requires
convoluted pre-hoc training. In this work, we propose to directly model images
and videos as compressed files saved on computers via canonical codecs (e.g.,
JPEG, AVC/H.264). Using the default Llama architecture without any
vision-specific modifications, we pretrain JPEG-LM from scratch to generate
images (and AVC-LM to generate videos as a proof of concept), by directly
outputting compressed file bytes in JPEG and AVC formats. Evaluation of image
generation shows that this simple and straightforward approach is more
effective than pixel-based modeling and sophisticated vector quantization
baselines (on which our method yields a 31% reduction in FID). Our analysis
shows that JPEG-LM has an especial advantage over vector quantization models in
generating long-tail visual elements. Overall, we show that using canonical
codec representations can help lower the barriers between language generation
and visual generation, facilitating future research on multi-modal
language/image/video LLMs.Summary
AI-Generated Summary