ChatPaper.aiChatPaper

JPEG-LM:LLM作为具有规范编解码器表示的图像生成器

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

August 15, 2024
作者: Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov
cs.AI

摘要

最近在图像和视频生成领域,由于其通用性和潜在易于集成到多模态系统中,开始采用自回归LLM架构。将自回归训练在语言生成中应用到视觉生成的关键在于离散化,即将诸如图像和视频之类的连续数据表示为离散标记。离散化图像和视频的常见方法包括对原始像素值进行建模,这是非常冗长的,或者使用矢量量化,这需要复杂的预先训练。在这项工作中,我们提出直接将图像和视频建模为通过经典编解码器(例如JPEG、AVC/H.264)保存在计算机上的压缩文件。使用默认的Llama架构,无需任何视觉特定修改,我们从头开始预训练JPEG-LM以生成图像(以AVC-LM生成视频作为概念验证),直接输出JPEG和AVC格式的压缩文件字节。图像生成的评估表明,这种简单直接的方法比基于像素建模和复杂的矢量量化基线更有效(我们的方法在其中实现了FID减少31%)。我们的分析显示,JPEG-LM在生成长尾视觉元素方面比矢量量化模型具有特殊优势。总体而言,我们展示了使用经典编解码器表示可以有助于降低语言生成和视觉生成之间的障碍,促进未来多模态语言/图像/视频LLM研究。
English
Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization -- representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

Summary

AI-Generated Summary

PDF464November 26, 2024