智能香蕉代理:基于代理思维与工具链的高保真图像编辑
Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
February 9, 2026
作者: Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, Ryan Rossi, Wenhao Chai, Zhengzhong Tu
cs.AI
摘要
我们研究专业工作流下的指令驱动图像编辑,发现三个长期存在的挑战:(i)编辑者常过度编辑,超出用户意图修改内容;(ii)现有模型多为单轮编辑,而多轮修改会削弱对象保真度;(iii)当前约1K分辨率的评估标准与真实工作流中常使用的超高清图像(如4K)不匹配。为此提出Agent Banana——一种面向高保真、对象感知、审慎编辑的分层智能体规划-执行框架。该框架引入两大核心机制:(1)上下文折叠:将长交互历史压缩为结构化记忆,实现稳定的长程控制;(2)图像图层分解:通过基于图层的局部化编辑保护非目标区域,同时支持原生分辨率输出。为支撑严谨评估,我们构建HDD-Bench高清对话式基准数据集,包含可验证的渐进式目标及原生4K图像(1180万像素),用于诊断长程编辑失败案例。在HDD-Bench上,Agent Banana在保持指令跟随竞争力的同时,实现了最佳的多轮一致性及背景保真度(如IC 0.871、SSIM-OM 0.84、LPIPS-OM 0.12),并在标准单轮编辑基准上表现优异。本研究有望推动可靠的专业级智能体图像编辑技术及其在实际工作流中的集成应用。
English
We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.