MonetGPT:解谜提升多模态大语言模型的图像修复能力
MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills
May 9, 2025
作者: Niladri Shekhar Dutt, Duygu Ceylan, Niloy J. Mitra
cs.AI
摘要
修图是原始照片后期处理中的一项关键任务。基于文本或笔触引导的生成式编辑为用户提供了新的工具,但容易以不可接受且难以预测的方式改变原始对象的身份。相比之下,尽管传统程序化编辑(如Gimp、Lightroom等照片编辑工具所普遍支持的)较为保守,却仍受专业人士青睐。遗憾的是,专业级修图涉及众多独立的程序化编辑操作,这对大多数新手而言规划起来颇具挑战。本文探讨了是否可以通过教导多模态大语言模型(MLLM)来批判性地审视原始照片、提出合适的修正建议,并最终利用一组预设的程序化图像操作实现这些修正。我们展示了MLLM首先可以通过训练解决特别设计的视觉谜题,从而理解底层的图像处理操作。随后,这种具备操作意识的MLLM能够规划并提出编辑序列。为促进训练,给定一组专家编辑的照片,我们通过程序化操控专家编辑并基于视觉调整对预训练的大语言模型进行接地,合成推理数据集以用于微调。所提出的修图操作设计上易于用户理解,保留了对象细节和分辨率,并可选择性地被覆盖。我们在多种测试案例上评估了该设置,并展示了其在可解释性和身份保持方面相较于现有生成式及其他程序化替代方案的优势。代码、数据、模型及补充结果可通过我们的项目网站https://monetgpt.github.io获取。
English
Retouching is an essential task in post-manipulation of raw photographs.
Generative editing, guided by text or strokes, provides a new tool accessible
to users but can easily change the identity of the original objects in
unacceptable and unpredictable ways. In contrast, although traditional
procedural edits, as commonly supported by photoediting tools (e.g., Gimp,
Lightroom), are conservative, they are still preferred by professionals.
Unfortunately, professional quality retouching involves many individual
procedural editing operations that is challenging to plan for most novices. In
this paper, we ask if a multimodal large language model (MLLM) can be taught to
critique raw photographs, suggest suitable remedies, and finally realize them
with a given set of pre-authored procedural image operations. We demonstrate
that MLLMs can be first made aware of the underlying image processing
operations, by training them to solve specially designed visual puzzles.
Subsequently, such an operation-aware MLLM can both plan and propose edit
sequences. To facilitate training, given a set of expert-edited photos, we
synthesize a reasoning dataset by procedurally manipulating the expert edits
and then grounding a pretrained LLM on the visual adjustments, to synthesize
reasoning for finetuning. The proposed retouching operations are, by
construction, understandable by the users, preserve object details and
resolution, and can be optionally overridden. We evaluate our setup on a
variety of test examples and show advantages, in terms of explainability and
identity preservation, over existing generative and other procedural
alternatives. Code, data, models, and supplementary results can be found via
our project website at https://monetgpt.github.io.Summary
AI-Generated Summary