ChatPaper.aiChatPaper

TiViBench:面向视频生成模型的视频内思维推理基准测试

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

November 17, 2025
作者: Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen
cs.AI

摘要

视频生成模型的快速发展已使其关注点从产生视觉上合理的结果转向处理需要物理合理性和逻辑一致性的任务。然而,尽管近期出现了如Veo 3的帧链推理等突破性进展,这些模型是否能展现类似大语言模型(LLMs)的推理能力仍不明确。现有基准主要评估视觉保真度和时序连贯性,未能捕捉高阶推理能力。为填补这一空白,我们提出TiViBench——一个专门评估图像到视频(I2V)生成模型推理能力的分层基准。TiViBench系统性地从四个维度评估推理能力:i)结构推理与搜索、ii)空间与视觉模式推理、iii)符号与逻辑推理、iv)行动规划与任务执行,涵盖3个难度级别下的24个多样化任务场景。通过广泛评估,我们发现商业模型(如Sora 2、Veo 3.1)展现出更强的推理潜力,而开源模型虽存在未开发潜力,但仍受限于训练规模和数据多样性的不足。为释放这种潜力,我们受偏好优化启发提出了VideoTPO——一种简单有效的测试时策略。该方法通过LLM对生成候选结果进行自我分析以识别优劣,无需额外训练、数据或奖励模型即可显著提升推理性能。TiViBench与VideoTPO共同为评估和推进视频生成模型的推理能力开辟了新路径,为这一新兴领域的未来研究奠定了基础。
English
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
PDF424December 1, 2025