GEMS: メモリとスキルを備えたエージェントネイティブなマルチモーダル生成

要旨

近年、マルチモーダル生成モデルは汎用生成タスクにおいて目覚ましい進歩を遂げているが、複雑な指示や専門的な下流タスクには依然として課題を抱えている。本論文では、Claude Codeなどの先進的なエージェントフレームワークの成功に着想を得て、基礎モデルの本質的限界を汎用タスクと下流タスクの両方で克服するGEMS（Agent-Native Multimodal Generation with Memory and Skills）を提案する。GEMSは3つの核心コンポーネントで構成される。Agent Loopは構造化されたマルチエージェントフレームワークを導入し、閉ループ最適化を通じて生成品質を反復的に改善する。Agent Memoryは永続的な軌跡レベルのメモリを提供し、事実状態と圧縮された経験的要約を階層的に保存することで、冗長性を削減しつつ最適化プロセスのグローバルな視点を可能にする。Agent Skillはオンデマンド読み込み可能な分野特化型専門知識の拡張コレクションを提供し、多様な下流アプリケーションを効果的に処理することを可能にする。5つの主流タスクと4つの下流タスクにおいて、複数の生成バックエンドで評価した結果、GEMSは一貫して大幅な性能向上を達成した。特に注目すべきは、軽量な6BモデルであるZ-Image-TurboがGenEval2において最先端のNano Banana 2を凌駕することを可能にした点であり、これはエージェント活用がモデルの能力を元々の限界を超えて拡張する有効性を実証している。

English

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose GEMS (Agent-Native Multimodal GEneration with Memory and Skills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

GEMS: メモリとスキルを備えたエージェントネイティブなマルチモーダル生成

GEMS: Agent-Native Multimodal Generation with Memory and Skills

要旨

Support