視覺多智能體系統:通過視覺流緩解幻覺雪球效應
Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
September 26, 2025
作者: Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen, Zhucun Xue, Jiangning Zhang, Yue Liao, Xiaobin Hu, Yu-Gang Jiang, Shuicheng Yan
cs.AI
摘要
基於視覺語言模型(VLM)的多智能體系統(MAS)能夠執行具有挑戰性的任務,但卻面臨一種新型的失效現象——多智能體視覺幻覺雪球效應,即幻覺在單一智能體中萌生,並因過度依賴文本流來傳遞視覺信息而被後續智能體放大。通過對回合、層次及詞元級別的注意力分析,我們深入探討了幻覺雪球效應的本質,即視覺注意力分配的減少。這使我們識別出一組在中間層具有單峰注意力峰值的視覺詞元,這些詞元最能保留視覺證據,但在更深層的智能體回合中逐漸減弱,從而導致MAS中的視覺幻覺雪球效應。因此,我們提出了ViF,一種輕量級、即插即用的緩解範式,它利用選定的視覺中繼詞元驅動的視覺流來傳遞智能體間的消息,並應用注意力重分配來放大這一模式。實驗結果表明,我們的方法顯著減少了幻覺雪球效應,在基於四種常見MAS結構和十種基礎模型的八個基準測試中,性能得到了一致提升。源代碼將在以下網址提供:https://github.com/YU-deep/ViF.git。
English
Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables
challenging tasks but suffers from a novel failure term, multi-agent visual
hallucination snowballing, where hallucinations are seeded in a single agent
and amplified by following ones due to the over-reliance on textual flow to
relay visual information. Through turn-, layer-, and token-wise attention
analyses, we provide detailed insights into the essence of hallucination
snowballing regarding the reduction of visual attention allocation. It leads us
to identify a subset of vision tokens with a unimodal attention peak in middle
layers that best preserve visual evidence but gradually diminish in deeper
agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we
propose ViF, a lightweight, plug-and-play mitigation paradigm that relays
inter-agent messages with Visual Flow powered by the selected visual relay
tokens and applies attention reallocation to amplify this pattern. The
experiment results demonstrate that our method markedly reduces hallucination
snowballing, consistently improving the performance across eight benchmarks
based on four common MAS structures and ten base models. The source code will
be available at: https://github.com/YU-deep/ViF.git.