포크-머지 디코딩: 오디오-비주얼 대규모 언어 모델의 멀티모달 이해력 향상

초록

본 연구의 목표는 추가적인 학습 없이도 모달리티 편향을 해결함으로써 오디오-비주얼 대형 언어 모델(AV-LLMs)에서 균형 잡힌 다중 모달리티 이해를 향상시키는 것이다. 현재의 AV-LLMs에서는 오디오와 비디오 특성이 일반적으로 디코더에서 함께 처리된다. 이러한 전략은 통합된 다중 모달리티 이해를 용이하게 하지만, 불균형한 학습 신호로 인해 모델이 한 모달리티에 과도하게 의존하는 모달리티 편향을 초래할 수 있다. 이를 완화하기 위해, 우리는 추가적인 학습이나 아키텍처 수정 없이도 효과적인 추론 시 전략인 Fork-Merge Decoding(FMD)을 제안한다. FMD는 먼저 초기 디코더 레이어를 통해 오디오 전용 및 비디오 전용 입력을 처리하여 모달리티별 추론을 수행하고(포크 단계), 이후 남은 레이어에서 결과적인 은닉 상태를 병합하여 공동 추론을 수행한다(병합 단계). 이 접근법은 균형 잡힌 모달리티 기여를 촉진하고 모달리티 간 상호 보완적 정보를 활용한다. 우리는 VideoLLaMA2와 video-SALMONN이라는 두 가지 대표적인 AV-LLMs를 세 가지 벤치마크 데이터셋에서 평가하였다. 실험 결과는 오디오, 비디오, 그리고 결합된 오디오-비주얼 추론에 초점을 맞춘 과제에서 일관된 성능 향상을 보여주며, 강력한 다중 모달리티 이해를 위한 추론 시 개입의 효과를 입증한다.

English

The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without requiring additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (a fork phase), and then merges the resulting hidden states for joint reasoning in the remaining layers (a merge phase). This approach promotes balanced modality contributions and leverages complementary information across modalities. We evaluate our method on two representative AV-LLMs, VideoLLaMA2 and video-SALMONN, using three benchmark datasets. Experimental results demonstrate consistent performance improvements on tasks focused on audio, video, and combined audio-visual reasoning, demonstrating the effectiveness of inference-time interventions for robust multimodal understanding.

포크-머지 디코딩: 오디오-비주얼 대규모 언어 모델의 멀티모달 이해력 향상

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

초록

Support