VideoAutoArena：ユーザシミュレーションを通じたビデオ解析における大規模なマルチモーダルモデルを評価するための自動化されたアリーナ

要旨

最近、高度なビデオ解析機能を備えた大規模なマルチモーダルモデル（LMMs）が注目を集めています。ただし、ほとんどの評価は、VideoMMEやLongVideoBenchなどのベンチマークでの多肢選択問題などの従来の方法に依存しており、これらは実世界のユーザーの複雑な要求を捉えるために必要な深さを欠いている可能性があります。この制限に対処するために、およびビデオタスクの人間による注釈のコストが高く、遅いペースであることから、LMSYS Chatbot Arenaのフレームワークに触発されたアリーナ形式のベンチマークであるVideoAutoArenaを導入し、LMMsのビデオ解析能力を自動的に評価することを目指します。VideoAutoArenaは、ユーザーシミュレーションを活用して、ビデオ理解のモデルパフォーマンスを厳密に評価するオープンエンドで適応型の質問を生成します。このベンチマークには、公正で連続的な比較のために修正されたELOレーティングシステムを組み込んだ自動化されたスケーラブルな評価フレームワークが特徴として取り入れられています。自動判定システムの妥当性を検証するために、人間の注釈の慎重に選定されたサブセットを使用した「ゴールドスタンダード」を構築し、アリーナが人間の判断と強く一致しつつスケーラビリティを維持していることを示します。さらに、より複雑な質問を徐々に増やす欠陥駆動進化戦略を導入し、モデルをより困難なビデオ解析シナリオに対応させるように推進します。実験結果は、VideoAutoArenaが最先端のLMMsを効果的に区別し、モデルの強みや改善すべき点についての洞察を提供していることを示しています。評価をさらに効率化するために、ヒューマンアノテーターがVideoAutoArenaの一部で勝者をラベル付けする補助的なベンチマークであるVideoAutoBenchを導入します。GPT-4oを判定者として使用し、これらのヒューマン検証済みの回答と比較します。VideoAutoArenaとVideoAutoBenchは、ユーザーセントリックなビデオ解析においてLMMsを評価するための費用対効果の高いスケーラブルなフレームワークを提供しています。

English

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

VideoAutoArena：ユーザシミュレーションを通じたビデオ解析における大規模なマルチモーダルモデルを評価するための自動化されたアリーナ

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

要旨

Support