智能香蕉代理:基于代理思维与工具化的高保真图像编辑技术
Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
February 9, 2026
作者: Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, Ryan Rossi, Wenhao Chai, Zhengzhong Tu
cs.AI
摘要
我们在专业工作流下研究基于指令的图像编辑,发现存在三个持续存在的挑战:(i) 编辑器常出现过度编辑,修改内容超出用户意图;(ii)现有模型多为单轮编辑,而多轮编辑可能破坏对象保真度;(iii)约1K分辨率下的评估与真实工作流脱节,后者常需处理超高清图像(如4K)。为此提出Agent Banana——一种用于高保真、对象感知、审慎编辑的分层智能体规划-执行框架。该框架引入两大核心机制:(1)上下文折叠:将长交互历史压缩为结构化记忆,实现稳定的长程控制;(2)图像图层分解:基于局部化图层进行编辑,在保持非目标区域的同时支持原生分辨率输出。为支撑严谨评估,我们构建了HDD-Bench高清对话式基准数据集,包含可验证的渐进式目标及原生4K图像(1180万像素),用于诊断长程编辑失败案例。在HDD-Bench上,Agent Banana在保持指令跟随竞争力的同时,实现了最佳的多轮一致性及背景保真度(如IC 0.871、SSIM-OM 0.84、LPIPS-OM 0.12),并在标准单轮编辑基准上表现强劲。本研究有望推动可靠的专业级智能体图像编辑技术及其在实际工作流中的集成应用。
English
We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.