JarvisArt:通过智能照片润色助手释放人类艺术创造力
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
June 21, 2025
作者: Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, Shuicheng Yan
cs.AI
摘要
照片修饰已成为当代视觉叙事不可或缺的一部分,使用户能够捕捉美学并展现创意。尽管Adobe Lightroom等专业工具提供了强大的功能,但它们需要深厚的专业知识和大量手动操作。相比之下,现有的基于AI的解决方案虽实现了自动化,却常受限于可调节性不足和泛化能力差,难以满足多样化和个性化的编辑需求。为弥合这一差距,我们推出了JarvisArt,一个由多模态大语言模型(MLLM)驱动的智能体,它能理解用户意图,模仿专业艺术家的推理过程,并智能协调Lightroom中的200多种修饰工具。JarvisArt经历了两阶段训练:首先通过思维链监督微调建立基础推理和工具使用能力,随后采用面向修饰的群体相对策略优化(GRPO-R)进一步提升其决策制定和工具熟练度。我们还提出了Agent-to-Lightroom协议,以实现与Lightroom的无缝集成。为评估性能,我们开发了MMArt-Bench,一个基于真实用户编辑构建的新颖基准。JarvisArt展示了用户友好的交互、卓越的泛化能力以及对全局和局部调整的精细控制,为智能照片修饰开辟了新途径。值得注意的是,在MMArt-Bench上,JarvisArt在内容保真度的平均像素级指标上以60%的提升超越了GPT-4o,同时保持了相当的指令跟随能力。项目页面:https://jarvisart.vercel.app/。
English
Photo retouching has become integral to contemporary visual storytelling,
enabling users to capture aesthetics and express creativity. While professional
tools such as Adobe Lightroom offer powerful capabilities, they demand
substantial expertise and manual effort. In contrast, existing AI-based
solutions provide automation but often suffer from limited adjustability and
poor generalization, failing to meet diverse and personalized editing needs. To
bridge this gap, we introduce JarvisArt, a multi-modal large language model
(MLLM)-driven agent that understands user intent, mimics the reasoning process
of professional artists, and intelligently coordinates over 200 retouching
tools within Lightroom. JarvisArt undergoes a two-stage training process: an
initial Chain-of-Thought supervised fine-tuning to establish basic reasoning
and tool-use skills, followed by Group Relative Policy Optimization for
Retouching (GRPO-R) to further enhance its decision-making and tool
proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate
seamless integration with Lightroom. To evaluate performance, we develop
MMArt-Bench, a novel benchmark constructed from real-world user edits.
JarvisArt demonstrates user-friendly interaction, superior generalization, and
fine-grained control over both global and local adjustments, paving a new
avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a
60% improvement in average pixel-level metrics on MMArt-Bench for content
fidelity, while maintaining comparable instruction-following capabilities.
Project Page: https://jarvisart.vercel.app/.