AV-Reasoner: MLLM을 위한 단서 기반 오디오-비주얼 카운팅 개선 및 벤치마킹

초록

비디오 이해 분야에서의 진전에도 불구하고, 현재의 MLLM(Multimodal Large Language Models)들은 카운팅 작업에 어려움을 겪고 있습니다. 기존 벤치마크들은 짧은 비디오 길이, 폐쇄형 질문, 단서 주석의 부재, 그리고 약한 다중모달 커버리지로 인해 제한적입니다. 본 논문에서는 497개의 긴 비디오에 걸쳐 1,027개의 다중모달 질문과 5,845개의 주석이 달린 단서를 포함한 수동 주석 기반의 CG-AV-Counting 벤치마크를 소개합니다. 이 벤치마크는 블랙박스와 화이트박스 평가를 모두 지원하며, 종단간 및 추론 기반 카운팅을 위한 포괄적인 테스트베드 역할을 합니다. 모델의 카운팅 능력을 향상시키는 방법을 탐구하기 위해, 우리는 GRPO와 커리큘럼 학습을 통해 훈련된 AV-Reasoner 모델을 제안합니다. 이 모델은 관련 작업들로부터 카운팅 능력을 일반화하며, 여러 벤치마크에서 최첨단 성능을 달성함으로써 강화 학습의 효과를 입증합니다. 그러나 실험 결과, 도메인 외 벤치마크에서는 언어 공간에서의 추론이 성능 향상을 가져오지 못하는 것으로 나타났습니다. 코드와 벤치마크는 https://av-reasoner.github.io에서 공개되었습니다.

English

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

AV-Reasoner: MLLM을 위한 단서 기반 오디오-비주얼 카운팅 개선 및 벤치마킹

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

초록

Support