JarvisArt：透過智慧照片修飾代理釋放人類藝術創造力

摘要

照片修飾已成為當代視覺敘事不可或缺的一部分，使用戶能夠捕捉美學並表達創意。雖然如Adobe Lightroom等專業工具提供了強大的功能，但它們需要大量的專業知識和手動操作。相比之下，現有的基於人工智能的解決方案雖然提供了自動化，但往往因可調節性有限和泛化能力差而無法滿足多樣化和個性化的編輯需求。為彌補這一差距，我們引入了JarvisArt，這是一個由多模態大語言模型（MLLM）驅動的代理，能夠理解用戶意圖，模仿專業藝術家的推理過程，並智能協調Lightroom內的200多種修飾工具。JarvisArt經歷了兩階段訓練過程：首先進行思維鏈監督微調以建立基本推理和工具使用技能，隨後進行修飾群組相對策略優化（GRPO-R）以進一步提升其決策能力和工具熟練度。我們還提出了代理至Lightroom協議，以促進與Lightroom的無縫集成。為評估性能，我們開發了MMArt-Bench，這是一個基於真實用戶編輯構建的新穎基準。JarvisArt展示了用戶友好的交互、卓越的泛化能力以及對全局和局部調整的精細控制，為智能照片修飾開辟了新途徑。值得注意的是，在MMArt-Bench上，JarvisArt在內容保真度的平均像素級指標上比GPT-4o提升了60%，同時保持了相當的指令遵循能力。項目頁面：https://jarvisart.vercel.app/。

English

Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: https://jarvisart.vercel.app/.

JarvisArt：透過智慧照片修飾代理釋放人類藝術創造力

JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

摘要

Support