CoSTAast: マルチターン画像編集のためのコスト感応型ツールパスエージェント

要旨

Stable DiffusionやDALLE-3などのテキストから画像を生成するモデルは、マルチターン画像編集において依然として課題を抱えています。私たちは、このようなタスクを、さまざまなコストのAIツールを使用して一連のサブタスクに対処するエージェント的なワークフロー（パス）として分解します。従来の探索アルゴリズムでは、ツールパスを見つけるために高コストな探索が必要です。一方、大規模言語モデル（LLM）はサブタスク計画に関する事前知識を持っていますが、各サブタスクでどのツールを適用するかを決定するためのツールの能力とコストの正確な見積もりが不足している可能性があります。LLMとグラフ探索の両方の強みを組み合わせて、コスト効率の良いツールパスを見つけることは可能でしょうか？私たちは、LLMを活用してサブタスクツリーを作成し、与えられたタスクに対してAIツールのグラフを刈り込み、その後小さなサブグラフ上でA*探索を行ってツールパスを見つける「CoSTA*」という3段階のアプローチを提案します。総コストと品質のバランスをより良く取るために、CoSTA*は各サブタスクにおける各ツールの両方のメトリクスを組み合わせてA*探索を導きます。各サブタスクの出力は視覚言語モデル（VLM）によって評価され、失敗した場合にはそのツールのコストと品質が更新されます。これにより、A*探索は迅速に失敗から回復し、他のパスを探索することができます。さらに、CoSTA*はサブタスク間でモダリティを自動的に切り替えることで、コストと品質のトレードオフをより良く実現します。私たちは、挑戦的なマルチターン画像編集の新しいベンチマークを構築し、CoSTA*はコストと品質の両面で最先端の画像編集モデルやエージェントを上回り、ユーザーの好みに応じて多様なトレードオフを実現します。

English

Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

CoSTAast: マルチターン画像編集のためのコスト感応型ツールパスエージェント

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

要旨

Support