**VIBE:基於視覺指令的編輯器**
VIBE: Visual Instruction Based Editor
January 5, 2026
作者: Grigorii Alekseenko, Aleksandr Gordeev, Irina Tolstykh, Bulat Suleimanov, Vladimir Dokholyan, Georgii Fedorov, Sergey Yakubson, Aleksandra Tsybina, Mikhail Chernyshov, Maksim Kuprashevich
cs.AI
摘要
基於指令的圖像編輯已成為生成式人工智慧領域發展最迅速的技術方向之一。過去一年間,該領域邁入新階段,數十個開源模型與高性能商業系統相繼問世。然而,目前僅有少量開源方案能達到實際應用所需的品質水準。此外,作為主流技術架構的擴散模型骨幹網路通常體積龐大、計算成本高昂,廣泛使用的變體往往包含60億至200億參數,對多數部署場景與研究環境構成挑戰。本文提出一種緊湊型高吞吐指令圖像編輯流程,採用現代化的20億參數Qwen3-VL模型指導編輯過程,並使用16億參數擴散模型Sana1.5進行圖像生成。我們在架構設計、數據處理、訓練配置與評估標準等方面均以低成本推理和嚴格源一致性為目標,同時在當前規模可實現的主要編輯類別中保持高品質輸出。在ImgEdit與GEdit基準測試中,本方法達到或超越了參數量數倍於此、推理成本更高的基線模型表現,尤其在需要保持輸入圖像特徵的編輯任務(如屬性調整、物件移除、背景編輯及定向替換)上表現突出。該模型僅需24GB GPU記憶體,在NVIDIA H100使用BF16精度時無需額外推理優化或蒸餾處理,即可於約4秒內生成最高2K解析度的編輯圖像。
English
Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.