ChatPaper.aiChatPaper

VIBE:基于视觉指令的编辑器

VIBE: Visual Instruction Based Editor

January 5, 2026
作者: Grigorii Alekseenko, Aleksandr Gordeev, Irina Tolstykh, Bulat Suleimanov, Vladimir Dokholyan, Georgii Fedorov, Sergey Yakubson, Aleksandra Tsybina, Mikhail Chernyshov, Maksim Kuprashevich
cs.AI

摘要

基于指令的图像编辑是生成式AI中发展最快的领域之一。过去一年间,该领域已达到全新水平,数十个开源模型与高性能商业系统相继发布。然而,目前仅有少数开源方法能实现实用级质量。此外,作为主流技术路线的扩散模型骨架通常参数量庞大、计算成本高昂,在多数部署和研究场景中,广泛使用的变体通常包含60亿至200亿参数。本文提出了一种紧凑型高通量的指令图像编辑流程:采用现代20亿参数的Qwen3-VL模型指导编辑过程,配合16亿参数的扩散模型Sana1.5进行图像生成。我们在架构设计、数据处理、训练配置和评估标准上均以低成本推理与严格源一致性为目标,同时保持该规模下主要编辑类别的高质量输出。在ImgEdit和GEdit基准测试中,本方法达到或超越了参数规模数倍、推理成本更高的基线模型性能,尤其在需要保持输入图像特性的编辑任务上表现突出,包括属性调整、物体移除、背景编辑及定向替换等。该模型仅需24GB GPU显存,在NVIDIA H100上以BF16精度生成2K分辨率编辑图像仅需约4秒,且无需额外推理优化或蒸馏处理。
English
Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.
PDF452January 17, 2026