EvalVerse: 전문 시네마틱 비디오 생성을 위한 파이프라인 인식 및 전문가 보정 벤치마킹

초록

생성형 비디오 기반 모델의 급속한 발전은 해당 분야를 전문가 수준의 시네마틱 합성으로 이끌고 있습니다. 이러한 높은 품질 요구를 충족하기 위해, 연구 커뮤니티는 강화 학습(RL) 및 에이전트 기반 워크플로우로 전환하고 있습니다. 그러나 신뢰할 수 있는 평가가 중요한 병목 현상으로 부상했습니다. 기존 벤치마크는 주로 '옳은가' (기본적인 프롬프트 준수 여부)를 평가하는 반면, '좋은가' (시네마틱 품질, 연기, 미적 감각)는 근본적으로 간과해 왔습니다. 더욱이, 현재의 자동화된 평가지표는 신뢰할 수 있는 신호를 제공하는 데 필요한 도메인 특화 엄격성을 결여하여, 인간의 미적 인식과 기계 점수 사이에 심각한 신뢰성 격차를 야기합니다. 이러한 격차를 해소하기 위해, 우리는 EvalVerse를 소개합니다. 이는 포괄적이고, 파이프라인을 인식하며, 전문가에 의해 보정된 평가 프레임워크입니다. 우리는 비디오 생성 평가를 단순한 엔지니어링 작업이 아닌, 주관적인 시네마틱 전문 지식을 체계적으로 디지털화하는 핵심 과학적 문제로 간주합니다. 첫째, 우리는 도메인 지식을 전문 영화 제작 워크플로우(사전 제작, 제작, 사후 제작)에 맞춰 정렬된 평가 분류 체계로 구성합니다. 둘째, 우리는 인간 전문가의 판단을 대규모 인간 주석이 포함된 선별된 데이터셋으로 추출합니다. 셋째, 우리는 이 지식을 전문가 보정 미세 조정 전략을 통해 Vision-Language Models(VLM)에 주입하여, VLM이 명시적 Chain-of-Thought 추론을 수행할 수 있도록 합니다. 이전 연구들과 비교하여, EvalVerse는 기본적인 '정확성' 지표와의 호환성을 유지할 뿐만 아니라, 평가 기준을 '우수성'으로 크게 확장하고, 작업 범위를 복잡한 멀티샷 시퀀싱 및 시청각 통합으로 넓힙니다. 결과적으로, 세분화된 진단 신호를 제공함으로써 EvalVerse는 정적인 리더보드를 초월하여 보상 모델 및 평가 에이전트와 같은 미래 연구를 위한 기본 인프라를 구축합니다.

English

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.