視覚的文書理解と質問応答：テスト時スケーリングを備えたマルチエージェント協調フレームワーク

要旨

既存の視覚言語モデル（VLMs）は、汎用型であれ専門型であれ、そのパラメータ規模に制約され、堅牢な自己修正能力を欠き、長い視覚的文脈や複雑な推論を伴うタスクにおいて性能が低く、文書ベースのタスクでは最適な性能を発揮できていない。これを解決するため、我々は視覚的文書理解と視覚的質問応答（VQA）に特化した、テストタイムスケーリングを備えたマルチエージェント協調フレームワークであるMACTを提案する。MACTは、計画、実行、判断、回答の4つの異なる小規模エージェントで構成され、それぞれの役割が明確に定義され、効果的な協調が行われる。特に、判断エージェントは正確性を検証し、修正のために前段のエージェントにリダイレクトする役割を担い、従来の修正戦略を上回る性能を示す。さらに、フレームワークの能力限界を拡張するために、エージェント固有の能力と全体の協調をバランスする混合報酬モデリング、および各エージェントの機能に基づいて異なるスケーリング戦略をカスタマイズするエージェントごとのハイブリッドテストタイムスケーリングを提案する。文書ベースおよび非文書ベースの設定にわたるベンチマークで評価された結果、我々のMACTは、汎用タスクや数学的タスクの能力を犠牲にすることなく、より小さなパラメータ規模で優れた性能を示した。特に、長い視覚的文脈や複雑な推論を伴うベンチマークにおいて際立った性能を発揮した。MACTの3つのバリエーションは、平均スコアで常に上位3位を維持し、15のベンチマークのうち13で首位を獲得した。コードはhttps://github.com/YU-deep/MACT.gitで公開予定である。

English

Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.

視覚的文書理解と質問応答：テスト時スケーリングを備えたマルチエージェント協調フレームワーク

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

要旨

Support