VA-π:面向像素感知自回歸生成的變分策略校準
VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
December 22, 2025
作者: Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao
cs.AI
摘要
自迴歸視覺生成模型依賴標記器將圖像與離散序列相互映射。然而標記器的訓練目標是從真實標記重建清晰圖像,而自迴歸生成器僅針對標記似然性進行優化。這種不對齊會導致生成的標記序列解碼成低品質圖像,且缺乏像素空間的直接監督。我們提出VA-π——一種輕量級訓練後優化框架,通過具理論依據的像素空間目標直接優化自迴歸模型。VA-π將生成器-標記器對齊問題構建為變分優化,推導出統一像素重建與自迴歸建模的證據下界。為在離散標記空間中實現優化,VA-π引入基於強化學習的對齊策略:將自迴歸生成器視為策略網絡,以像素空間重建品質作為內在獎勵。該獎勵通過教師強制模式下預測標記序列重建原圖像的準確度來衡量,無需耗時的自由運行採樣即可為模型提供像素級指導。證據下界中的正則化項作為自然約束器,維持標記的分佈一致性。VA-π無需重新訓練標記器或外部獎勵模型,即可快速適配現有自迴歸生成器。僅使用1% ImageNet-1K數據和25分鐘微調,即在LlamaGen-XXL上將FID從14.36降至7.65、IS從86.55提升至116.70;在GenEval文本生成圖像任務中,視覺生成模型(LlamaGen:從0.306升至0.339)與統一多模態模型(Janus-Pro:從0.725升至0.744)均獲得顯著提升。代碼已開源於https://github.com/Lil-Shake/VA-Pi。
English
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-π, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-π formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-π introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-π enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.