Moebius: 0.2Bパラメータの軽量画像インペインティングフレームワーク、10Bレベルの性能を達成

要旨

10Bレベルの産業用基盤モデルは画像インペインティングの限界を押し広げてきたが、その膨大な計算コストが実用的な展開を著しく妨げている。高度に最適化されたタスク特化型専門家モデルを構築することは有望な解決策を提供するが、極端な構造圧縮は必然的に深刻な表現のボトルネックを引き起こす。これに対処するため、我々はMoebiusという高効率な軽量インペインティングフレームワークを提案する。我々はLocal-λ Mix Interaction（LλMI）ブロックを導入することで、拡散バックボーンを体系的に再構築する。このブロックはLocal-λモジュールとInteractive-λモジュールから構成され、空間的文脈と大域的意味的先行知識を固定サイズの線形行列にエレガントに要約し、複雑な潜在相互作用を維持しつつパラメータを劇的に削減する。さらに、この高度にコンパクトなアーキテクチャの表現能力を最大限に引き出すために、適応的なマルチ粒度蒸留戦略と相乗的に組み合わせる。この戦略は、高価なピクセル空間でのデコードを避けるために潜在空間内でのみ厳密に動作し、複数の勾配ベースの損失を動的にバランスさせて高忠実度なアライメントを実現する。自然画像およびポートレートベンチマークでの広範な実験により、この最適な相乗効果によりMoebiusが10Bレベルの産業用汎用モデルFLUX.1-Fill-Devと同等以上の生成品質を達成できることが示された。特筆すべきは、Moebiusがパラメータの2%未満（0.22B対11.9B）でこれを実現し、総推論時間で15倍以上の高速化を達成し、高忠実度インペインティングの新たな効率基準を打ち立てたことである。プロジェクトページ：https://hustvl.github.io/Moebius

English

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-λ Mix Interaction (LλMI) block. Comprising Local-λ and Interactive-λ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a >15times acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

Moebius: 0.2Bパラメータの軽量画像インペインティングフレームワーク、10Bレベルの性能を達成

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

要旨

Support