ビジュアルマルチエージェントシステム：ビジュアルフローによる幻覚の連鎖的増幅の緩和

要旨

視覚言語モデル（VLM）を基盤とするマルチエージェントシステム（MAS）は、困難なタスクを可能にする一方で、新たな失敗要因である「マルチエージェント視覚的幻覚の雪だるま現象」に悩まされています。これは、単一のエージェントで幻覚が発生し、その後のエージェントが視覚情報を伝えるためにテキストの流れに過度に依存することで増幅される現象です。ターンごと、層ごと、トークンごとの注意分析を通じて、幻覚の雪だるま現象の本質を詳細に明らかにし、視覚的注意配分の減少に関連する洞察を提供します。これにより、中間層で単峰性の注意ピークを持つ視覚トークンのサブセットが、視覚的証拠を最もよく保持するが、深いエージェントのターンで徐々に減少し、MASにおける視覚的幻覚の雪だるま現象を引き起こすことが明らかになりました。そこで、選択された視覚リレートークンによる視覚フローを活用し、注意再配分を適用してこのパターンを増幅する軽量でプラグアンドプレイの緩和パラダイム「ViF」を提案します。実験結果は、我々の方法が幻覚の雪だるま現象を著しく減少させ、4つの一般的なMAS構造と10の基本モデルに基づく8つのベンチマークで一貫して性能を向上させることを示しています。ソースコードは以下で公開予定です：https://github.com/YU-deep/ViF.git。

English

Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code will be available at: https://github.com/YU-deep/ViF.git.

ビジュアルマルチエージェントシステム：ビジュアルフローによる幻覚の連鎖的増幅の緩和

Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

要旨

Support