视觉文档理解与问答：一种具备测试时扩展能力的多智能体协作框架

摘要

现有的视觉-语言模型（VLMs），无论是通用型还是专用型，都受限于其参数量级，缺乏强大的自我修正能力，在处理长视觉上下文和复杂推理任务时表现欠佳，导致在文档类任务中的表现不尽如人意。为此，我们提出了MACT，一种专为视觉文档理解和视觉问答（VQA）设计的、具备测试时扩展能力的多智能体协作框架。该框架由四个功能明确且协作高效的小规模智能体构成，即规划、执行、判断和回答智能体。特别地，判断智能体专门负责验证正确性并引导前序智能体进行修正，其表现优于传统的修正策略。为了进一步拓展框架的能力边界，我们提出了混合奖励模型，以平衡智能体特定能力与全局协作，以及智能体级别的混合测试时扩展策略，根据各智能体的功能定制不同的扩展方案。在涵盖文档类和非文档类场景的基准测试中，MACT以较小的参数量级展现了卓越性能，且未牺牲通用任务和数学任务的处理能力。尤其是在涉及长视觉上下文和复杂推理的基准测试中表现突出。MACT的三个变体在平均得分上稳居前三，在15个基准测试中的13个中领先。代码将发布于：https://github.com/YU-deep/MACT.git。

English

Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.

视觉文档理解与问答：一种具备测试时扩展能力的多智能体协作框架

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

摘要

Support