MonetGPT:解謎提升多模態大語言模型的圖像修復能力
MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills
May 9, 2025
作者: Niladri Shekhar Dutt, Duygu Ceylan, Niloy J. Mitra
cs.AI
摘要
修飾是原始照片後期處理中的一項關鍵任務。基於文字或筆觸引導的生成式編輯為用戶提供了一種新工具,但這種方式容易以不可接受且不可預測的方式改變原始物件的身份。相比之下,儘管傳統的程序化編輯(如Gimp、Lightroom等照片編輯工具所支持的)較為保守,卻仍受到專業人士的青睞。然而,專業級的修飾涉及眾多單獨的程序化編輯操作,這對大多數新手來說規劃起來頗具挑戰性。本文探討了是否能夠教導多模態大型語言模型(MLLM)來評析原始照片、提出適當的修正建議,並最終利用一組預先編寫的程序化圖像操作來實現這些修正。我們展示了,通過訓練MLLM解決特別設計的視覺謎題,可以首先使其了解底層的圖像處理操作。隨後,這樣一個具備操作意識的MLLM能夠規劃並提出編輯序列。為了促進訓練,在給定一組專家編輯的照片後,我們通過程序化操縱這些專家編輯並基於視覺調整對預訓練的LLM進行接地,來合成一個推理數據集,用於微調。所提出的修飾操作,在設計上便於用戶理解,能夠保留物件細節和分辨率,並且可以選擇性地被覆蓋。我們在多種測試樣例上評估了我們的設置,並展示了在可解釋性和身份保持方面相較於現有的生成式及其他程序化替代方案的優勢。代碼、數據、模型及補充結果可通過我們的項目網站https://monetgpt.github.io獲取。
English
Retouching is an essential task in post-manipulation of raw photographs.
Generative editing, guided by text or strokes, provides a new tool accessible
to users but can easily change the identity of the original objects in
unacceptable and unpredictable ways. In contrast, although traditional
procedural edits, as commonly supported by photoediting tools (e.g., Gimp,
Lightroom), are conservative, they are still preferred by professionals.
Unfortunately, professional quality retouching involves many individual
procedural editing operations that is challenging to plan for most novices. In
this paper, we ask if a multimodal large language model (MLLM) can be taught to
critique raw photographs, suggest suitable remedies, and finally realize them
with a given set of pre-authored procedural image operations. We demonstrate
that MLLMs can be first made aware of the underlying image processing
operations, by training them to solve specially designed visual puzzles.
Subsequently, such an operation-aware MLLM can both plan and propose edit
sequences. To facilitate training, given a set of expert-edited photos, we
synthesize a reasoning dataset by procedurally manipulating the expert edits
and then grounding a pretrained LLM on the visual adjustments, to synthesize
reasoning for finetuning. The proposed retouching operations are, by
construction, understandable by the users, preserve object details and
resolution, and can be optionally overridden. We evaluate our setup on a
variety of test examples and show advantages, in terms of explainability and
identity preservation, over existing generative and other procedural
alternatives. Code, data, models, and supplementary results can be found via
our project website at https://monetgpt.github.io.Summary
AI-Generated Summary