헵타포드: 시각 신호 기반 언어 모델링

초록

우리는 언어 모델링의 기본 원칙을 준수하는 이미지 자기회귀 모델인 헵타포드(Heptapod)를 소개한다. 헵타포드는 인과적 주의 메커니즘을 사용하며, CFG(Class-Free Guidance)에 대한 의존성을 제거하고, 의미론적 토크나이저 사용 경향을 피한다. 우리의 핵심 혁신은 다음 2D 분포 예측이다: 재구축 중심의 시각적 토크나이저를 갖춘 인과적 트랜스포머는 각 시간 단계에서 이미지의 전체 2D 공간 그리드에 대한 분포를 예측하도록 학습한다. 이 학습 목표는 자기회귀 프레임워크의 순차적 모델링과 마스크된 자동 인코딩의 전체적 자기 지도 학습을 통합하여, 생성적 훈련을 통해 포괄적인 이미지 의미를 포착할 수 있도록 한다. ImageNet 생성 벤치마크에서 헵타포드는 2.70의 FID(Fréchet Inception Distance)를 달성하며, 기존의 인과적 자기회귀 접근법을 크게 능가한다. 우리의 연구가 시각 신호 및 그 이상의 영역에서 언어 모델링에 대한 원칙적인 재고를 촉발하기를 바란다.

English

We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs causal attention, eliminates reliance on CFG, and eschews the trend of semantic tokenizers. Our key innovation is next 2D distribution prediction: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of 2.70, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.

헵타포드: 시각 신호 기반 언어 모델링

Heptapod: Language Modeling on Visual Signals

초록

Support