AV-Reasoner：提升與基準測試基於線索的多模態大語言模型音視覺計數能力

摘要

儘管在視頻理解方面取得了進展，當前的多模態大語言模型（MLLMs）在計數任務上仍面臨挑戰。現有的基準測試受限於短視頻、封閉式查詢、缺乏線索註釋以及多模態覆蓋不足。本文介紹了CG-AV-Counting，這是一個手動註釋的線索基礎計數基準，包含497個長視頻中的1,027個多模態問題和5,845個註釋線索。它支持黑盒和白盒評估，為端到端和基於推理的計數提供了全面的測試平台。為了探索提升模型計數能力的方法，我們提出了AV-Reasoner，這是一個通過GRPO和課程學習訓練的模型，旨在從相關任務中泛化計數能力。AV-Reasoner在多個基準測試中達到了最先進的結果，證明了強化學習的有效性。然而，實驗表明，在域外基準測試中，語言空間的推理未能帶來性能提升。代碼和基準測試已發佈於https://av-reasoner.github.io。

English

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

AV-Reasoner：提升與基準測試基於線索的多模態大語言模型音視覺計數能力

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

摘要

Support