AV-Reasoner: MLLM向けの手がかりに基づく音声視覚計数機能の改善とベンチマーク

要旨

ビデオ理解の進展にもかかわらず、現在のMLLM（マルチモーダル大規模言語モデル）は計数タスクに苦戦しています。既存のベンチマークは、短いビデオ、閉じたクエリ、手がかりアノテーションの欠如、そして弱いマルチモーダルカバレッジによって制限されています。本論文では、CG-AV-Countingを紹介します。これは、497本の長いビデオにわたる1,027のマルチモーダル質問と5,845のアノテーションされた手がかりを含む、手動でアノテーションされた手がかりに基づく計数ベンチマークです。これはブラックボックスとホワイトボックスの両方の評価をサポートし、エンドツーエンドおよび推論ベースの計数の包括的なテストベッドとして機能します。モデルの計数能力を向上させる方法を探るために、GRPOとカリキュラム学習で訓練されたAV-Reasonerを提案します。AV-Reasonerは、関連タスクから計数能力を一般化するために設計されており、複数のベンチマークで最先端の結果を達成し、強化学習の有効性を実証しています。しかし、実験では、ドメイン外のベンチマークでは、言語空間での推論が性能向上をもたらさないことが示されています。コードとベンチマークはhttps://av-reasoner.github.ioで公開されています。

English

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

AV-Reasoner: MLLM向けの手がかりに基づく音声視覚計数機能の改善とベンチマーク

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

要旨

Support