Moebius: 0,2B Lichtgewicht Beeldinpaintingsframework met Prestaties op 10B-Niveau

Samenvatting

Hoewel 10B-niveau industriële funderingsmodellen de grenzen van beeldinpainting hebben verlegd, belemmeren hun buitensporige computationele kosten de praktische implementatie ernstig. Het bouwen van een sterk geoptimaliseerde taakspecifieke specialist biedt een veelbelovende oplossing; echter leidt extreme structurele compressie onvermijdelijk tot een ernstige representatiefles. Om dit te overwinnen introduceren we Moebius, een zeer efficiënt lichtgewicht inpainting-framework. We reconstrueren systematisch de diffusie-backbone door het Local-λ Mix Interaction (LλMI)-blok te introduceren. Bestaande uit Local-λ- en Interactive-λ-modules, vat het elegant ruimtelijke contexten en globale semantische prioriteiten samen in lineaire matrices van vaste grootte, waarbij complexe latente interacties behouden blijven terwijl parameters drastisch worden verminderd. Verder koppelen we deze zeer compacte architectuur synergetisch aan een adaptieve multi-granulariteitsdistillatiestrategie om het volledige representatievermogen ervan te ontgrendelen. Deze strategie werkt strikt binnen de latente ruimte om dure pixelruimtedecodering te vermijden en balanceert dynamisch meerdere gradiëntgebaseerde verliezen om een high-fidelity uitlijning te bereiken. Uitgebreide experimenten op natuurlijke en portretbenchmarks tonen aan dat deze optimale synergie Moebius in staat stelt om de generatiekwaliteit van de 10B-niveau industriële generalist FLUX.1-Fill-Dev te evenaren of zelfs te overtreffen. Opmerkelijk genoeg bereikt Moebius dit met minder dan 2% van de parameters (0,22B vs. 11,9B) en levert het een >15-voudige versnelling van de totale inferentietijd, waarmee het een nieuwe efficiëntienorm stelt voor high-fidelity inpainting. Projectpagina op https://hustvl.github.io/Moebius.

English

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-λ Mix Interaction (LλMI) block. Comprising Local-λ and Interactive-λ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a >15times acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.