CoSTAast:面向多輪圖像編輯的成本敏感型工具路徑代理
CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing
March 13, 2025
作者: Advait Gupta, NandaKiran Velaga, Dang Nguyen, Tianyi Zhou
cs.AI
摘要
如穩定擴散(Stable Diffusion)和DALLE-3等文本生成圖像模型,在多輪圖像編輯任務上仍面臨挑戰。我們將此類任務分解為一個工具使用的代理工作流程(路徑),通過不同成本的AI工具來處理一系列子任務。傳統的搜索算法需要進行昂貴的探索來找到工具路徑。儘管大型語言模型(LLMs)具備子任務規劃的先驗知識,但它們可能缺乏對工具能力和成本的準確估計,從而難以確定在每個子任務中應使用哪種工具。我們能否結合LLMs和圖搜索的優勢,找到成本效益高的工具路徑?我們提出了一種三階段方法“CoSTA*”,該方法利用LLMs創建子任務樹,這有助於為給定任務修剪AI工具圖,然後在小型子圖上進行A*搜索以找到工具路徑。為了更好地平衡總成本和質量,CoSTA*結合了每個工具在每個子任務上的兩個指標來指導A*搜索。每個子任務的輸出隨後由視覺語言模型(VLM)進行評估,若失敗則觸發對該工具在該子任務上成本和質量的更新。因此,A*搜索能夠快速從失敗中恢復,探索其他路徑。此外,CoSTA*能夠在子任務之間自動切換模式,以實現更好的成本質量權衡。我們構建了一個具有挑戰性的多輪圖像編輯新基準,在此基準上,CoSTA*在成本和質量方面均優於最先進的圖像編輯模型或代理,並能根據用戶偏好進行多樣化的權衡。
English
Text-to-image models like stable diffusion and DALLE-3 still struggle with
multi-turn image editing. We decompose such a task as an agentic workflow
(path) of tool use that addresses a sequence of subtasks by AI tools of varying
costs. Conventional search algorithms require expensive exploration to find
tool paths. While large language models (LLMs) possess prior knowledge of
subtask planning, they may lack accurate estimations of capabilities and costs
of tools to determine which to apply in each subtask. Can we combine the
strengths of both LLMs and graph search to find cost-efficient tool paths? We
propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask
tree, which helps prune a graph of AI tools for the given task, and then
conducts A* search on the small subgraph to find a tool path. To better balance
the total cost and quality, CoSTA* combines both metrics of each tool on every
subtask to guide the A* search. Each subtask's output is then evaluated by a
vision-language model (VLM), where a failure will trigger an update of the
tool's cost and quality on the subtask. Hence, the A* search can recover from
failures quickly to explore other paths. Moreover, CoSTA* can automatically
switch between modalities across subtasks for a better cost-quality trade-off.
We build a novel benchmark of challenging multi-turn image editing, on which
CoSTA* outperforms state-of-the-art image-editing models or agents in terms of
both cost and quality, and performs versatile trade-offs upon user preference.Summary
AI-Generated Summary