視覺文件理解與問答系統:基於多代理協作框架的測試時擴展
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling
August 5, 2025
作者: Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan
cs.AI
摘要
現有的視覺語言模型(VLMs),無論是通用型還是專業型,均受制於其參數規模,缺乏強大的自我校正能力,且在涉及長視覺上下文和複雜推理的任務中表現欠佳,導致在基於文檔的任務中表現不盡如人意。為此,我們提出了MACT,一種多智能體協作框架,配備測試時擴展功能,專為視覺文檔理解和視覺問答(VQA)設計。該框架由四個獨立的小規模智能體組成,即規劃、執行、判斷和回答智能體,各司其職且協作高效。值得注意的是,判斷智能體專司驗證正確性並引導至前序智能體進行修正,其表現超越傳統的校正策略。為進一步拓展框架的能力邊界,我們提出了混合獎勵建模,以平衡智能體特定能力與全局協作,以及智能體層面的混合測試時擴展,根據各智能體的功能定制不同的擴展策略。在涵蓋文檔與非文檔場景的基準測試中,我們的MACT以較小的參數規模展現出優異性能,且未犧牲通用及數學任務的能力。尤其在涉及長視覺上下文和複雜推理的基準測試中表現突出。MACT的三種變體在平均得分上穩居前三,並在15個基準測試中的13個領先。代碼將於以下網址提供:https://github.com/YU-deep/MACT.git。
English
Existing vision-language models (VLMs), whether generalists or specialists,
remain constrained by their parameter scale, lack robust self-correction
capabilities, and underperform in tasks involving long visual contexts and
complex reasoning, resulting in suboptimal performance on document-based tasks.
To address this, we propose MACT, a Multi-Agent Collaboration framework with
Test-Time scaling, tailored for visual document understanding and visual
question answering (VQA). It comprises four distinct small-scale agents, i.e.,
planning, execution, judgment, and answer agents, with clearly defined roles
and effective collaboration. Notably, the judgment agent exclusively verifies
correctness and redirects to prior agents for revisions, outperforming
conventional correction strategies. To further expand the capability boundaries
of the framework, we propose mixed reward modeling that balances agent-specific
abilities and global collaboration, as well as agent-wise hybrid test-time
scaling, which customizes different scaling strategies for each agent based on
their functions. Evaluated on benchmarks spanning both document-based and
non-document-based settings, our MACT shows superior performance with a smaller
parameter scale without sacrificing the ability of general and mathematical
tasks. Especially, it stands out in benchmarks involving long visual contexts
and complicated reasoning. The three variants of MACT consistently hold the top
three positions in average scores, leading in 13 of the 15 benchmarks. Code
will be available at: https://github.com/YU-deep/MACT.git.