FaSTA^*: 효율적인 다중 턴 이미지 편집을 위한 서브루틴 마이닝 기반 고속-저속 툴패스 에이전트

초록

우리는 "이미지에서 벤치를 감지하고 분홍색으로 다시 칠하세요. 또한 더 명확한 시야를 위해 고양이를 제거하고 벽을 노란색으로 다시 칠하세요."와 같은 도전적인 다중 턴 이미지 편집 작업을 해결하기 위해 비용 효율적인 신경-기호 에이전트를 개발했습니다. 이 에이전트는 대형 언어 모델(LLMs)의 빠르고 높은 수준의 하위 작업 계획과 느리지만 정확한 도구 사용 및 로컬 A^* 탐색을 결합하여 비용 효율적인 도구 경로(즉, AI 도구 호출 시퀀스)를 찾습니다. 유사한 하위 작업에 대한 A^*의 비용을 절약하기 위해, 우리는 이전에 성공한 도구 경로에 대해 LLMs를 통해 귀납적 추론을 수행하여 자주 사용되는 서브루틴을 지속적으로 추출/개선하고 이를 새로운 도구로 재사용하여 미래 작업에 적용합니다. 이는 적응형 빠른-느린 계획 방식으로, 상위 수준의 서브루틴이 먼저 탐색되고, 이들이 실패할 때만 저수준 A^* 탐색이 활성화됩니다. 재사용 가능한 기호적 서브루틴은 유사한 이미지에 적용된 동일한 유형의 하위 작업에 대한 탐색 비용을 상당히 절약하며, 인간과 유사한 빠른-느린 도구 경로 에이전트 "FaSTA^*"를 생성합니다: 빠른 하위 작업 계획과 규칙 기반 서브루틴 선택이 LLMs에 의해 먼저 시도되며, 이는 대부분의 작업을 커버할 것으로 기대됩니다. 반면, 느린 A^* 탐색은 새롭고 도전적인 하위 작업에 대해서만 트리거됩니다. 최근의 이미지 편집 접근법과 비교하여, 우리는 FaSTA^*가 계산적으로 훨씬 더 효율적이면서도 성공률 측면에서 최신 기준선과 경쟁력을 유지한다는 것을 입증했습니다.

English

We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.'' It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A^* search per subtask to find a cost-efficient toolpath -- a sequence of calls to AI tools. To save the cost of A^* on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A^* search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent "FaSTA^*'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A^* search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA^* is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.

FaSTA^*: 효율적인 다중 턴 이미지 편집을 위한 서브루틴 마이닝 기반 고속-저속 툴패스 에이전트

FaSTA^*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

초록

Support