GEMS:具备记忆与技能的原生多模态智能体生成系统
GEMS: Agent-Native Multimodal Generation with Memory and Skills
March 30, 2026
作者: Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang
cs.AI
摘要
近期,多模态生成模型在通用生成任务上取得了显著进展,但在处理复杂指令和专业化下游任务时仍面临挑战。受Claude Code等先进智能体框架成功经验的启发,我们提出GEMS(具备记忆与技能的智能体原生多模态生成框架),该框架通过突破基础模型在通用任务和下游任务上的固有局限,实现了性能跃升。GEMS建立在三大核心组件之上:智能体循环通过结构化多智能体框架实现闭环优化,迭代提升生成质量;智能体记忆提供持续性的轨迹级记忆库,分层存储事实状态与压缩经验摘要,既能统观优化全局又可减少冗余;智能体技能则提供可扩展的领域专属知识库,支持按需加载,使系统能有效应对多样化下游应用。在五大主流任务和四大下游任务的评测中,基于多种生成后端验证,GEMS均取得显著性能提升。最值得注意的是,该框架使轻量级6B模型Z-Image-Turbo在GenEval2基准上超越了当前最先进的Nano Banana 2,证明了智能体协同机制在突破模型原始能力边界方面的有效性。
English
Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose GEMS (Agent-Native Multimodal GEneration with Memory and Skills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.