시각적 에이전트 강화 미세 조정

초록

대형 추론 모델(예: OpenAI의 o3)의 주요 트렌드 중 하나는 웹 브라우저와 같은 외부 도구를 사용하여 검색을 수행하고, 이미지 조작을 위한 코드를 작성 및 실행하며, 이미지를 통해 사고하는 네이티브 에이전트 능력입니다. 오픈소스 연구 커뮤니티에서는 함수 호출 및 도구 통합과 같은 언어 전용 에이전트 능력에서 상당한 진전이 있었지만, 이미지를 통해 진정으로 사고하는 다중 모달 에이전트 능력과 이에 상응하는 벤치마크 개발은 아직 덜 탐구된 상태입니다. 본 연구는 대형 시각-언어 모델(LVLMs)을 위한 유연하고 적응적인 추론 능력을 가능하게 하는 시각적 에이전트 강화 미세 조정(Visual-ARFT)의 효과를 강조합니다. Visual-ARFT를 통해 오픈소스 LVLMs는 실시간 정보 업데이트를 위해 웹사이트를 탐색하고, 입력 이미지를 자르기, 회전 및 기타 이미지 처리 기술을 통해 조작 및 분석하는 코드를 작성할 수 있는 능력을 얻습니다. 또한, LVLMs의 에이전트 검색 및 코딩 능력을 평가하기 위해 두 가지 설정(MAT-Search 및 MAT-Coding)으로 구성된 다중 모달 에이전트 도구 벤치(MAT)를 제시합니다. 실험 결과에 따르면, Visual-ARFT는 MAT-Coding에서 기준선 대비 +18.6% F1 / +13.0% EM, MAT-Search에서 +10.3% F1 / +8.7% EM으로 우수한 성능을 보이며, 궁극적으로 GPT-4o를 능가합니다. Visual-ARFT는 또한 2Wiki 및 HotpotQA와 같은 기존의 다중 홉 QA 벤치마크에서 +29.3% F1 / +25.9% EM의 성능 향상을 달성하여 강력한 일반화 능력을 입증합니다. 우리의 연구 결과는 Visual-ARFT가 견고하고 일반화 가능한 다중 모달 에이전트를 구축하기 위한 유망한 경로를 제공함을 시사합니다.

English

A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.