CoSTAast: 다중 턴 이미지 편집을 위한 비용 민감형 툴패스 에이전트

초록

Stable Diffusion과 DALLE-3 같은 텍스트-이미지 모델들은 여전히 다중 단계 이미지 편집에 어려움을 겪고 있습니다. 우리는 이러한 작업을 다양한 비용의 AI 도구를 사용하여 일련의 하위 작업을 해결하는 도구 사용의 에이전트 워크플로우(경로)로 분해합니다. 기존의 탐색 알고리즘은 도구 경로를 찾기 위해 비용이 많이 드는 탐색을 필요로 합니다. 반면, 대형 언어 모델(LLMs)은 하위 작업 계획에 대한 사전 지식을 가지고 있지만, 각 하위 작업에 어떤 도구를 적용할지 결정하기 위해 도구의 능력과 비용을 정확히 추정하지 못할 수 있습니다. LLMs와 그래프 탐색의 강점을 결합하여 비용 효율적인 도구 경로를 찾을 수 있을까요? 우리는 "CoSTA*"라는 세 단계 접근법을 제안합니다. 이 방법은 LLMs를 활용하여 하위 작업 트리를 생성하고, 주어진 작업에 대한 AI 도구 그래프를 정제한 다음, 작은 하위 그래프에서 A* 탐색을 수행하여 도구 경로를 찾습니다. 총 비용과 품질을 더 잘 균형 잡기 위해, CoSTA*는 각 하위 작업에서 각 도구의 두 가지 메트릭을 결합하여 A* 탐색을 안내합니다. 각 하위 작업의 출력은 시각-언어 모델(VLM)에 의해 평가되며, 실패가 발생하면 해당 하위 작업에서 도구의 비용과 품질이 업데이트됩니다. 따라서 A* 탐색은 실패에서 빠르게 복구하여 다른 경로를 탐색할 수 있습니다. 또한, CoSTA*는 하위 작업 간에 모달리티를 자동으로 전환하여 더 나은 비용-품질 균형을 달성할 수 있습니다. 우리는 도전적인 다중 단계 이미지 편집을 위한 새로운 벤치마크를 구축했으며, CoSTA*는 비용과 품질 모두에서 최신 이미지 편집 모델이나 에이전트를 능가하고, 사용자 선호에 따라 다양한 균형을 제공합니다.

English

Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

CoSTAast: 다중 턴 이미지 편집을 위한 비용 민감형 툴패스 에이전트

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

초록

Support