分叉合併解碼:提升音視覺大型語言模型的多模態理解能力
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models
May 27, 2025
作者: Chaeyoung Jung, Youngjoon Jang, Jongmin Choi, Joon Son Chung
cs.AI
摘要
本研究旨在通过解决模态偏差问题,无需额外训练即可增强视听大语言模型(AV-LLMs)中的平衡多模态理解能力。在当前的AV-LLMs中,音频和视频特征通常在解码器中联合处理。尽管这一策略促进了统一的多模态理解,但也可能引入模态偏差,即模型由于训练信号的不平衡而过度依赖某一模态。为缓解此问题,我们提出了分叉-合并解码(Fork-Merge Decoding, FMD),一种简单而有效的推理时策略,无需额外训练或架构修改。FMD首先通过早期解码层处理仅音频和仅视频输入,进行模态特定推理(分叉阶段),随后将生成的隐藏状态合并,在剩余层中进行联合推理(合并阶段)。该方法促进了模态贡献的平衡,并利用了跨模态的互补信息。我们在两个代表性的AV-LLMs——VideoLLaMA2和video-SALMONN上,使用三个基准数据集评估了我们的方法。实验结果表明,在专注于音频、视频及联合视听推理的任务上,性能均得到了一致提升,证明了推理时干预对于稳健多模态理解的有效性。
English
The goal of this work is to enhance balanced multimodal understanding in
audio-visual large language models (AV-LLMs) by addressing modality bias
without requiring additional training. In current AV-LLMs, audio and video
features are typically processed jointly in the decoder. While this strategy
facilitates unified multimodal understanding, it may introduce modality bias,
where the model tends to over-rely on one modality due to imbalanced training
signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet
effective inference-time strategy that requires no additional training or
architectural modifications. FMD first performs modality-specific reasoning by
processing audio-only and video-only inputs through the early decoder layers (a
fork phase), and then merges the resulting hidden states for joint reasoning in
the remaining layers (a merge phase). This approach promotes balanced modality
contributions and leverages complementary information across modalities. We
evaluate our method on two representative AV-LLMs, VideoLLaMA2 and
video-SALMONN, using three benchmark datasets. Experimental results demonstrate
consistent performance improvements on tasks focused on audio, video, and
combined audio-visual reasoning, demonstrating the effectiveness of
inference-time interventions for robust multimodal understanding.