GEMS：具備記憶與技能的原生多模態生成代理

摘要

近期多模態生成模型在通用生成任務上取得顯著進展，但在處理複雜指令與專業下游任務時仍面臨挑戰。受Claude Code等高階智能體框架成功啟發，我們提出GEMS（具記憶與技能的智能體原生多模態生成框架），該框架突破基礎模型在通用任務與下游任務上的固有局限。GEMS基於三大核心組件構建：智能體迴圈引入結構化多智能體框架，通過閉環優化迭代提升生成質量；智能體記憶提供持久化的軌跡級記憶系統，分層存儲事實狀態與壓縮的經驗摘要，實現優化過程的全局視野同時減少冗餘；智能體技能組建可擴展的領域專長庫，支持按需加載，使系統能有效處理多樣化下游應用。在五大主流任務與四項下游任務的評估中，GEMS於多個生成後端上均實現顯著性能提升。最引人注目的是，該框架使輕量級6B模型Z-Image-Turbo在GenEval2基準上超越當前最先進的Nano Banana 2，證實智能體協同機制能有效突破模型原有能力邊界。

English

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose GEMS (Agent-Native Multimodal GEneration with Memory and Skills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

GEMS：具備記憶與技能的原生多模態生成代理

GEMS: Agent-Native Multimodal Generation with Memory and Skills

摘要

Support