Heptapod:基于视觉信号的语言建模
Heptapod: Language Modeling on Visual Signals
October 8, 2025
作者: Yongxin Zhu, Jiawei Chen, Yuanzhe Chen, Zhuo Chen, Dongya Jia, Jian Cong, Xiaobin Zhuang, Yuping Wang, Yuxuan Wang
cs.AI
摘要
我们推出Heptapod,一种遵循语言建模基础原则的图像自回归模型。Heptapod采用因果注意力机制,摒弃了对CFG的依赖,并避开了语义分词器的流行趋势。我们的核心创新是二维分布预测:一个专注于重建的视觉分词器与因果Transformer相结合,学习在每个时间步预测整个二维空间网格上的图像分布。这一学习目标将自回归框架的序列建模与掩码自编码的整体自监督学习统一起来,使模型能够通过生成训练捕捉全面的图像语义。在ImageNet生成基准测试中,Heptapod取得了2.70的FID分数,显著超越了以往的因果自回归方法。我们希望我们的工作能激发对视觉信号乃至更广泛领域语言建模原则的重新思考。
English
We introduce Heptapod, an image autoregressive model that adheres to the
foundational principles of language modeling. Heptapod employs causal
attention, eliminates reliance on CFG, and eschews the trend
of semantic tokenizers. Our key innovation is next 2D distribution
prediction: a causal Transformer with reconstruction-focused visual tokenizer,
learns to predict the distribution over the entire 2D spatial grid of images at
each timestep. This learning objective unifies the sequential modeling of
autoregressive framework with the holistic self-supervised learning of masked
autoencoding, enabling the model to capture comprehensive image semantics via
generative training. On the ImageNet generation benchmark, Heptapod achieves an
FID of 2.70, significantly outperforming previous causal autoregressive
approaches. We hope our work inspires a principled rethinking of language
modeling on visual signals and beyond.