MIMFlow：マスク画像モデリングと正規化フローの統合によるエンドツーエンド画像生成

要旨

正規化フロー（NF）は、正確な密度推定とサンプリングが可能な強力な生成モデルである。しかし、その厳格な可逆性により、モデルが低レベルのピクセル詳細に容量を費やし、高レベルの意味構造の捕捉を妨げることが多い。一方、マスク画像モデリング（MIM）は表現学習において優れた成果を上げているが、生成パイプラインへの統合は主にモジュール的で断片的なままである。本論文では、潜在意味、ピクセル再構成、生成フローを共同で最適化する統一されたエンドツーエンドフレームワークMIMFlowを提案する。マスク画像から意味的潜在変数を推論するためにVAEエンコーダを採用することで、MIMFlowは生成タスクの原理的な分離を実現する。正規化フローは単純化された低周波の意味多様体のモデリングに集中し、特殊なデコーダが高周波合成を担当する。この設計により、NFの本質的な容量ボトルネックが効果的に解消され、冗長なノイズよりも全体的な構造的一貫性を優先することが可能となる。ImageNet 256×256における実験結果は、MIMFlow-Lが71.3％の線形プローブ精度と2.50のFIDを達成することを示している。標準モデルより50％少ない128トークンのみを使用しているにもかかわらず、同規模のNFベースラインに対して32.8％の性能向上をもたらす。コードはhttps://github.com/MCG-NJU/MIMFlowで公開されている。

English

Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256times256 show that MIMFlow-L reaches 71.3\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.