PixWizard：具有开放语言指导的多功能图像到图像视觉助手

摘要

本文介绍了一款多功能的图像到图像视觉助手 PixWizard，旨在基于自由形式语言指令进行图像生成、操作和翻译。为此，我们将各种视觉任务融合到统一的图像-文本-图像生成框架中，并策划了一个全像素指令调整数据集。通过构建自然语言中的详细指令模板，我们全面涵盖了大量多样化的视觉任务，如文本到图像生成、图像恢复、图像定位、密集图像预测、图像编辑、可控生成、修补/补全等。此外，我们采用扩散Transformer（DiT）作为基础模型，并通过灵活的任意分辨率机制扩展了其能力，使模型能够根据输入的纵横比动态处理图像，与人类感知过程密切对齐。该模型还融合了结构感知和语义感知指导，促进了从输入图像中有效融合信息。我们的实验表明，PixWizard 不仅展现了对具有不同分辨率的图像具有令人印象深刻的生成和理解能力，还展示了在未见任务和人类指令下具有良好的泛化能力。代码和相关资源可在 https://github.com/AFeng-x/PixWizard 获取。

English

This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at https://github.com/AFeng-x/PixWizard