Moebius:一個僅0.2B參數的輕量級影像修補框架,展現10B等級的效能表現
Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance
June 17, 2026
作者: Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang
cs.AI
摘要
尽管10B级别的工业基础模型在图像修复领域取得了突破性进展,但其高昂的计算成本严重制约了实际部署。构建一个高度优化的任务专用模型是颇具前景的解决方案,然而极端结构压缩不可避免地会引发严重的表征瓶颈。为攻克这一难题,我们提出了Moebius——一种高效轻量化的修复框架。我们通过引入局部-λ混合交互(LλMI)模块,系统性地重构了扩散模型骨干网络。该模块由局部-λ和交互-λ两部分组成,巧妙地将空间上下文和全局语义先验归纳为固定大小的线性矩阵,在显著减少参数的同时保留了复杂的潜在交互。此外,为充分释放这一高紧凑架构的表征能力,我们将其与自适应多粒度蒸馏策略协同配合。该策略严格在潜在空间内运行(避免昂贵的像素级解码),通过动态平衡多种基于梯度的损失函数,实现高保真度的对齐。在自然图像和肖像基准上的大量实验表明,这种最优协同使得Moebius在生成质量上能够媲美甚至超越10B级别的工业通用模型FLUX.1-Fill-Dev。尤为引人注目的是,Moebius仅用不到其2%的参数(0.22B对比11.9B),同时实现了超过15倍的总推理加速,为高保真图像修复树立了新的效率标杆。项目页面:https://hustvl.github.io/Moebius。
English
While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-λ Mix Interaction (LλMI) block. Comprising Local-λ and Interactive-λ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a >15times acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.