시각적 다중 에이전트 시스템: 시각적 흐름을 통한 환각 현상의 누적 완화

초록

시각 언어 모델(VLM) 기반의 다중 에이전트 시스템(MAS)은 복잡한 작업을 수행할 수 있지만, 텍스트 흐름에 과도하게 의존하여 시각 정보를 전달함으로써 단일 에이전트에서 시작된 환각이 후속 에이전트들에 의해 증폭되는 새로운 실패 현상인 다중 에이전트 시각 환각 눈덩이 효과를 겪습니다. 턴별, 레이어별, 토큰별 주의력 분석을 통해, 우리는 시각 주의력 할당의 감소와 관련된 환각 눈덩이 효과의 본질에 대한 상세한 통찰을 제공합니다. 이를 통해 중간 레이어에서 단일 양식 주의력 피크를 보이며 시각적 증거를 가장 잘 보존하지만 더 깊은 에이전트 턴에서 점차 감소하는 시각 토큰의 하위 집합을 식별했습니다. 이는 MAS에서 시각 환각 눈덩이 효과를 초래합니다. 따라서 우리는 선택된 시각 릴레이 토큰에 의해 구동되는 시각 흐름(ViF)으로 에이전트 간 메시지를 전달하고 이 패턴을 증폭하기 위해 주의력 재할당을 적용하는 경량의 플러그 앤 플레이 완화 패러다임인 ViF를 제안합니다. 실험 결과는 우리의 방법이 환각 눈덩이 효과를 현저히 줄이고, 네 가지 일반적인 MAS 구조와 열 가지 기본 모델을 기반으로 한 여덟 가지 벤치마크에서 일관되게 성능을 향상시킴을 보여줍니다. 소스 코드는 https://github.com/YU-deep/ViF.git에서 제공될 예정입니다.

English

Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code will be available at: https://github.com/YU-deep/ViF.git.

시각적 다중 에이전트 시스템: 시각적 흐름을 통한 환각 현상의 누적 완화

Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

초록

Support