ChatPaper.aiChatPaper

AV-Reasoner:提升并基准测试基于线索的多模态大语言模型音视频计数能力

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

June 5, 2025
作者: Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, Tong Lu
cs.AI

摘要

尽管视频理解领域取得了进展,当前的多模态大语言模型(MLLMs)在计数任务上仍面临挑战。现有基准测试受限于短视频、封闭式查询、线索标注缺失以及多模态覆盖不足。本文提出了CG-AV-Counting,一个手工标注的基于线索的计数基准,包含497个长视频中的1,027个多模态问题和5,845个标注线索。该基准支持黑盒与白盒评估,为端到端及基于推理的计数提供了全面的测试平台。为探索提升模型计数能力的途径,我们提出了AV-Reasoner模型,该模型通过GRPO和课程学习训练,旨在从相关任务中泛化计数能力。AV-Reasoner在多个基准测试中达到了最先进的成绩,验证了强化学习的有效性。然而,实验表明,在域外基准测试中,语言空间内的推理未能带来性能提升。代码与基准测试已发布于https://av-reasoner.github.io。
English
Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.
PDF201June 6, 2025