TC-Bench: 텍스트-투-비디오 및 이미지-투-비디오 생성에서의 시간적 구성성 벤치마킹

초록

비디오 생성은 이미지 생성과는 다른 독특한 도전 과제들을 가지고 있습니다. 시간적 차원은 프레임 간 광범위한 변화 가능성을 도입하며, 이로 인해 일관성과 연속성이 위반될 수 있습니다. 본 연구에서는 단순한 동작 평가를 넘어, 생성된 비디오가 실제 세계의 비디오처럼 시간이 지남에 따라 새로운 개념의 출현과 그 관계 전이를 포함해야 한다고 주장합니다. 비디오 생성 모델의 시간적 구성성(Temporal Compositionality)을 평가하기 위해, 우리는 TC-Bench라는 벤치마크를 제안합니다. 이 벤치마크는 세심하게 설계된 텍스트 프롬프트, 해당하는 실제 비디오, 그리고 강력한 평가 지표로 구성되어 있습니다. 프롬프트는 장면의 초기 상태와 최종 상태를 명확히 표현함으로써 프레임 개발의 모호성을 줄이고 전이 완료 평가를 단순화합니다. 또한, 프롬프트에 부합하는 실제 비디오를 수집함으로써, TC-Bench의 적용 범위를 텍스트 조건 모델에서 생성적 프레임 보간을 수행할 수 있는 이미지 조건 모델로 확장합니다. 우리는 또한 생성된 비디오에서 구성 요소 전이의 완전성을 측정하기 위한 새로운 지표를 개발했으며, 이 지표는 기존 지표보다 인간 판단과 훨씬 높은 상관 관계를 보여줍니다. 우리의 포괄적인 실험 결과는 대부분의 비디오 생성기가 구성적 변화의 20% 미만을 달성함을 보여주며, 이는 향후 개선을 위한 엄청난 여지를 강조합니다. 우리의 분석은 현재의 비디오 생성 모델이 구성적 변화에 대한 설명을 해석하고 다양한 시간 단계에 걸쳐 여러 구성 요소를 합성하는 데 어려움을 겪고 있음을 나타냅니다.

English

Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

TC-Bench: 텍스트-투-비디오 및 이미지-투-비디오 생성에서의 시간적 구성성 벤치마킹

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

초록

Support