JarvisArt: 지능형 사진 보정 에이전트를 통해 인간의 예술적 창의성 해방하기

초록

사진 보정은 현대 시각적 스토리텔링의 필수 요소가 되었으며, 사용자들이 미학을 포착하고 창의성을 표현할 수 있게 해줍니다. Adobe Lightroom과 같은 전문 도구는 강력한 기능을 제공하지만 상당한 전문 지식과 수동 작업을 요구합니다. 반면, 기존의 AI 기반 솔루션은 자동화를 제공하지만 조정 가능성이 제한적이고 일반화가 부족하여 다양한 개인화된 편집 요구를 충족시키지 못하는 경우가 많습니다. 이러한 격차를 해소하기 위해, 우리는 JarvisArt를 소개합니다. JarvisArt는 다중 모드 대형 언어 모델(MLLM) 기반 에이전트로, 사용자의 의도를 이해하고 전문 아티스트의 사고 과정을 모방하며 Lightroom 내 200개 이상의 보정 도구를 지능적으로 조율합니다. JarvisArt는 두 단계의 훈련 과정을 거칩니다: 기본적인 사고 및 도구 사용 기술을 확립하기 위한 Chain-of-Thought 지도 미세 조정과, 의사 결정 및 도구 숙련도를 더욱 향상시키기 위한 Group Relative Policy Optimization for Retouching (GRPO-R)입니다. 또한, Lightroom과의 원활한 통합을 위한 Agent-to-Lightroom 프로토콜을 제안합니다. 성능 평가를 위해, 우리는 실제 사용자 편집을 기반으로 구성된 새로운 벤치마크인 MMArt-Bench를 개발했습니다. JarvisArt는 사용자 친화적인 상호작용, 우수한 일반화, 그리고 전역 및 지역 조정에 대한 세밀한 제어를 보여주며, 지능형 사진 보정을 위한 새로운 길을 열었습니다. 특히, JarvisArt는 MMArt-Bench에서 콘텐츠 충실도에 대한 평균 픽셀 수준 지표에서 GPT-4o를 60% 앞서며, 동시에 비슷한 명령 수행 능력을 유지합니다. 프로젝트 페이지: https://jarvisart.vercel.app/.

English

Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: https://jarvisart.vercel.app/.

JarvisArt: 지능형 사진 보정 에이전트를 통해 인간의 예술적 창의성 해방하기

JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

초록

Support