農夫模型:基於像素的流式自迴歸變換器
FARMER: Flow AutoRegressive Transformer over Pixels
October 27, 2025
作者: Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu
cs.AI
摘要
直接對原始數據分佈進行顯式似然建模是機器學習領域的核心課題,通過自迴歸建模在大語言模型中實現了規模化成功。然而,在視覺像素數據上進行連續自迴歸建模會面臨極長序列和高維空間的挑戰。本文提出FARMER——一種創新的端到端生成框架,將歸一化流與自迴歸模型相結合,實現從原始像素直接進行可追蹤似然估計與高質量圖像合成。FARMER採用可逆自迴歸流將圖像轉換為潛在序列,並通過自迴歸模型隱式建模其分佈。為解決像素級建模的冗餘性和複雜性,我們提出自監督降維方案,將歸一化流潛在通道劃分為信息組與冗餘組,從而實現更高效的自迴歸建模。此外,我們設計一步式蒸餾方案顯著加速推理速度,並提出基於重採樣的無分類器引導算法以提升圖像生成質量。大量實驗表明,FARMER在提供精確似然估計和可擴展訓練的同時,相比現有基於像素的生成模型具有競爭性性能。
English
Directly modeling the explicit likelihood of the raw data distribution is key
topic in the machine learning area, which achieves the scaling successes in
Large Language Models by autoregressive modeling. However, continuous AR
modeling over visual pixel data suffer from extremely long sequences and
high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end
generative framework that unifies Normalizing Flows (NF) and Autoregressive
(AR) models for tractable likelihood estimation and high-quality image
synthesis directly from raw pixels. FARMER employs an invertible autoregressive
flow to transform images into latent sequences, whose distribution is modeled
implicitly by an autoregressive model. To address the redundancy and complexity
in pixel-level modeling, we propose a self-supervised dimension reduction
scheme that partitions NF latent channels into informative and redundant
groups, enabling more effective and efficient AR modeling. Furthermore, we
design a one-step distillation scheme to significantly accelerate inference
speed and introduce a resampling-based classifier-free guidance algorithm to
boost image generation quality. Extensive experiments demonstrate that FARMER
achieves competitive performance compared to existing pixel-based generative
models while providing exact likelihoods and scalable training.