フォークマージデコーディング：音声視覚大規模言語モデルにおけるマルチモーダル理解の向上

要旨

本研究の目的は、追加の学習を必要とせずにモダリティバイアスを解消することで、音声-視覚大規模言語モデル（AV-LLMs）におけるバランスの取れたマルチモーダル理解を向上させることである。現在のAV-LLMsでは、音声と視覚の特徴は通常、デコーダで共同処理される。この戦略は統一されたマルチモーダル理解を促進するが、不均衡な学習信号により、モデルが一つのモダリティに過度に依存するモダリティバイアスを引き起こす可能性がある。これを緩和するため、追加の学習やアーキテクチャの変更を必要としない、推論時のシンプルかつ効果的な戦略であるFork-Merge Decoding（FMD）を提案する。FMDは、まず初期のデコーダ層で音声のみおよび視覚のみの入力を処理することでモダリティ固有の推論を行い（フォークフェーズ）、その後、残りの層で得られた隠れ状態を統合して共同推論を行う（マージフェーズ）。このアプローチは、モダリティ間のバランスの取れた貢献を促進し、モダリティ間の補完的な情報を活用する。我々は、代表的なAV-LLMsであるVideoLLaMA2とvideo-SALMONNを用いて、3つのベンチマークデータセットで本手法を評価した。実験結果は、音声、視覚、および音声-視覚の統合推論に焦点を当てたタスクにおいて、一貫した性能向上を示し、推論時の介入が堅牢なマルチモーダル理解に有効であることを実証している。

English

The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without requiring additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (a fork phase), and then merges the resulting hidden states for joint reasoning in the remaining layers (a merge phase). This approach promotes balanced modality contributions and leverages complementary information across modalities. We evaluate our method on two representative AV-LLMs, VideoLLaMA2 and video-SALMONN, using three benchmark datasets. Experimental results demonstrate consistent performance improvements on tasks focused on audio, video, and combined audio-visual reasoning, demonstrating the effectiveness of inference-time interventions for robust multimodal understanding.

フォークマージデコーディング：音声視覚大規模言語モデルにおけるマルチモーダル理解の向上

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

要旨

Support