ChatPaper.aiChatPaper

JPEG-LM:LLM作為具有規範編解碼器表示的圖像生成器

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

August 15, 2024
作者: Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov
cs.AI

摘要

最近在影像和影片生成方面的研究開始採用自回歸LLM架構,因為其通用性和潛在易於整合到多模態系統中。將自回歸訓練應用於語言生成的關鍵在於離散化,即將像圖像和影片這樣的連續數據表示為離散標記。將圖像和影片離散化的常見方法包括對原始像素值進行建模,這是冗長的,或者是向量量化,需要複雜的預先訓練。在這項工作中,我們建議將圖像和影片直接建模為通過標準編解碼器(例如JPEG、AVC/H.264)保存在計算機上的壓縮文件。使用默認的Llama架構,無需進行任何視覺特定修改,我們從頭開始預訓練JPEG-LM以生成圖像(以AVC-LM生成影片作為概念驗證),直接輸出JPEG和AVC格式的壓縮文件位元組。圖像生成的評估顯示,這種簡單直接的方法比基於像素的建模和複雜的向量量化基準更有效(我們的方法在其中減少了31%的FID)。我們的分析顯示,JPEG-LM在生成長尾視覺元素方面比向量量化模型具有特殊優勢。總的來說,我們展示了使用標準編解碼器表示可以幫助降低語言生成和視覺生成之間的障礙,促進未來多模態語言/圖像/影片LLM研究。
English
Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization -- representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

Summary

AI-Generated Summary

PDF464November 26, 2024