七足怪:基於視覺信號的語言建模
Heptapod: Language Modeling on Visual Signals
October 8, 2025
作者: Yongxin Zhu, Jiawei Chen, Yuanzhe Chen, Zhuo Chen, Dongya Jia, Jian Cong, Xiaobin Zhuang, Yuping Wang, Yuxuan Wang
cs.AI
摘要
我們介紹了Heptapod,這是一個遵循語言建模基礎原則的圖像自迴歸模型。Heptapod採用了因果注意力機制,摒棄了對CFG的依賴,並避開了語義分詞器的趨勢。我們的核心創新是下一維度分佈預測:一個專注於重建的視覺分詞器與因果Transformer結合,學習在每個時間步預測整個二維空間網格上的圖像分佈。這一學習目標將自迴歸框架的序列建模與掩碼自編碼的整體自監督學習相統一,使模型能夠通過生成式訓練捕捉全面的圖像語義。在ImageNet生成基準測試中,Heptapod取得了2.70的FID分數,顯著超越了以往的因果自迴歸方法。我們希望這項工作能激發對視覺信號及其他領域語言建模原則的重新思考。
English
We introduce Heptapod, an image autoregressive model that adheres to the
foundational principles of language modeling. Heptapod employs causal
attention, eliminates reliance on CFG, and eschews the trend
of semantic tokenizers. Our key innovation is next 2D distribution
prediction: a causal Transformer with reconstruction-focused visual tokenizer,
learns to predict the distribution over the entire 2D spatial grid of images at
each timestep. This learning objective unifies the sequential modeling of
autoregressive framework with the holistic self-supervised learning of masked
autoencoding, enabling the model to capture comprehensive image semantics via
generative training. On the ImageNet generation benchmark, Heptapod achieves an
FID of 2.70, significantly outperforming previous causal autoregressive
approaches. We hope our work inspires a principled rethinking of language
modeling on visual signals and beyond.