農夫模型：基於像素的流式自迴歸變換器

摘要

直接對原始數據分佈進行顯式似然建模是機器學習領域的核心課題，通過自迴歸建模在大語言模型中實現了規模化成功。然而，在視覺像素數據上進行連續自迴歸建模會面臨極長序列和高維空間的挑戰。本文提出FARMER——一種創新的端到端生成框架，將歸一化流與自迴歸模型相結合，實現從原始像素直接進行可追蹤似然估計與高質量圖像合成。FARMER採用可逆自迴歸流將圖像轉換為潛在序列，並通過自迴歸模型隱式建模其分佈。為解決像素級建模的冗餘性和複雜性，我們提出自監督降維方案，將歸一化流潛在通道劃分為信息組與冗餘組，從而實現更高效的自迴歸建模。此外，我們設計一步式蒸餾方案顯著加速推理速度，並提出基於重採樣的無分類器引導算法以提升圖像生成質量。大量實驗表明，FARMER在提供精確似然估計和可擴展訓練的同時，相比現有基於像素的生成模型具有競爭性性能。

English

Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

農夫模型：基於像素的流式自迴歸變換器

FARMER: Flow AutoRegressive Transformer over Pixels

摘要

Support