M-Longdoc: マルチモーダルな超長文書理解のためのベンチマークおよびリトリーバル重視のチューニングフレームワーク

要旨

ドキュメント上の質問を理解し回答する能力は、多くのビジネスおよび実用的なアプリケーションで有用である可能性があります。しかしながら、ドキュメントにはしばしばテキスト、図表など多様なマルチモーダルなコンテンツが含まれており、これらを徹底的に読むのは人間にとって非常に時間がかかります。そのため、この作業を支援するための効果的で自動化された方法を開発する緊急性があります。本研究では、851のサンプルからなるベンチマークであるM-LongDocと、大規模なマルチモーダルモデルの性能を評価するための自動化フレームワークを紹介します。さらに、効率的かつ効果的なマルチモーダルドキュメント読解のための検索認識チューニングアプローチを提案します。既存の研究と比較して、当社のベンチマークはより最近の長大なドキュメントで構成されており、数百ページに及ぶものも含まれています。また、抽出的な回答だけでなくオープンエンドの解答も必要としています。私たちの知る限り、当社のトレーニングフレームワークは、マルチモーダルな長大なドキュメントに対する検索設定に直接対処する最初のものです。オープンソースモデルのチューニングを可能にするために、このようなドキュメントに関する質問応答タスクのためのトレーニングコーパスを完全自動的に構築します。実験の結果、当社のチューニングアプローチは、ベースラインのオープンソースモデルと比較して、モデルの回答の正確性において相対的な改善率4.6%を達成しています。当社のデータ、コード、モデルは、https://multimodal-documents.github.io で入手可能です。

English

The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at https://multimodal-documents.github.io.

M-Longdoc: マルチモーダルな超長文書理解のためのベンチマークおよびリトリーバル重視のチューニングフレームワーク

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

要旨

Support