시각적 문서 이해 및 질문 응답: 테스트 시간 스케일링을 포함한 다중 에이전트 협업 프레임워크

초록

기존의 시각-언어 모델(VLMs)은 일반적이거나 특수한 목적을 가진 모델 모두 매개변수 규모에 제약을 받으며, 강력한 자기 수정 능력이 부족하고, 긴 시각적 맥락과 복잡한 추론을 포함하는 작업에서 성능이 떨어져 문서 기반 작업에서 최적의 성능을 내지 못한다. 이를 해결하기 위해, 우리는 시각적 문서 이해와 시각적 질의응답(VQA)을 위해 설계된 테스트 시간 확장을 갖춘 다중 에이전트 협업 프레임워크인 MACT를 제안한다. 이 프레임워크는 계획, 실행, 판단, 답변 에이전트라는 네 가지 독특한 소규모 에이전트로 구성되며, 각각의 역할이 명확히 정의되고 효과적으로 협업한다. 특히, 판단 에이전트는 정확성을 독점적으로 검증하고 수정을 위해 이전 에이전트로 재지향함으로써 기존의 수정 전략을 능가한다. 프레임워크의 능력 한계를 더욱 확장하기 위해, 우리는 에이전트별 능력과 전역적 협업을 균형 있게 조절하는 혼합 보상 모델링과 각 에이전트의 기능에 따라 맞춤화된 에이전트별 하이브리드 테스트 시간 확장을 제안한다. 문서 기반 및 비문서 기반 설정을 아우르는 벤치마크에서 평가된 결과, 우리의 MACT는 더 작은 매개변수 규모로도 일반적 및 수학적 작업 능력을 희생하지 않으면서 우수한 성능을 보였다. 특히, 긴 시각적 맥락과 복잡한 추론을 포함하는 벤치마크에서 두드러진 성과를 보였다. MACT의 세 가지 변형은 평균 점수에서 상위 세 자리를 꾸준히 차지하며, 15개 벤치마크 중 13개에서 선두를 달렸다. 코드는 https://github.com/YU-deep/MACT.git에서 확인할 수 있다.

English

Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.

시각적 문서 이해 및 질문 응답: 테스트 시간 스케일링을 포함한 다중 에이전트 협업 프레임워크

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

초록

Support