ヘプタポッド：視覚信号に基づく言語モデリング

要旨

本論文では、言語モデリングの基本原理に基づいた画像自己回帰モデルであるHeptapodを紹介する。Heptapodは因果的アテンションを採用し、CFGへの依存を排除し、意味的トークナイザーのトレンドを避けている。我々の主要な革新は、次元2D分布予測である：再構成に焦点を当てた視覚的トークナイザーを備えた因果的Transformerが、各タイムステップで画像の2D空間グリッド全体にわたる分布を予測することを学習する。この学習目標は、自己回帰フレームワークの逐次モデリングとマスク付き自己符号化の全体的な自己教師あり学習を統合し、生成的なトレーニングを通じて包括的な画像意味論を捉えることを可能にする。ImageNet生成ベンチマークにおいて、HeptapodはFID 2.70を達成し、従来の因果的自己回帰アプローチを大幅に上回る性能を示した。我々の研究が、視覚信号およびそれ以上の領域における言語モデリングの原理的な再考を促すことを期待する。

English

We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs causal attention, eliminates reliance on CFG, and eschews the trend of semantic tokenizers. Our key innovation is next 2D distribution prediction: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of 2.70, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.

ヘプタポッド：視覚信号に基づく言語モデリング

Heptapod: Language Modeling on Visual Signals

要旨

Support