マルチモーダルプレゼンテーションの要約におけるVision-Languageモデルの活用：モダリティと構造の影響に関する研究

要旨

ビジョン・ランゲージモデル（VLMs）は、テキスト、画像、テキストと画像が交互に配置されたデータ、さらには長時間の動画など、複数の形式の視覚的およびテキスト情報を処理することができます。本研究では、様々な表現を入力として用いたVLMsによるマルチモーダルプレゼンテーションの自動要約について、細かい定量分析と定性分析を行います。これらの実験を通じて、テキストが豊富なマルチモーダルドキュメントから、異なる入力長の予算の下で要約を生成するためのコスト効率の良い戦略を提案します。動画ストリームから抽出したスライドを生の動画に対して入力として使用することが有益であること、また、スライドと文字起こしを交互に配置した構造化された表現が最高のパフォーマンスを発揮することを示します。最後に、マルチモーダルプレゼンテーションにおけるクロスモーダル相互作用の性質について考察し、この種のドキュメントを理解するためのVLMsの能力を向上させるための提案を共有します。

English

Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

マルチモーダルプレゼンテーションの要約におけるVision-Languageモデルの活用：モダリティと構造の影響に関する研究

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

要旨

Support