ChatPaper.aiChatPaper

TiViBench:針對影片生成模型的影片內思考推理基準測試

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

November 17, 2025
作者: Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen
cs.AI

摘要

影片生成模型的快速演進已將其焦點從產出視覺上合理的結果,轉向處理需要物理合理性和邏輯一致性的任務。然而,儘管近期出現如Veo 3的幀序列推理等突破性技術,這些模型是否能展現類似大型語言模型(LLMs)的推理能力仍不明朗。現有基準主要評估視覺逼真度與時間連貫性,未能捕捉高階推理能力。為彌合此差距,我們提出TiViBench——一個專為評估圖像轉影片(I2V)生成模型推理能力設計的層次化基準。TiViBench系統性地從四個維度評估推理能力:i) 結構推理與搜索、ii) 空間與視覺模式推理、iii) 符號與邏輯推理、iv) 行動規劃與任務執行,涵蓋3種難度級別下的24個多元任務場景。透過大規模評估,我們發現商業模型(如Sora 2、Veo 3.1)展現出更強的推理潛力,而開源模型雖存在未開發潛能,卻仍受制於有限的訓練規模與資料多樣性。為進一步釋放此潛力,我們受偏好優化啟發,提出VideoTPO——一種簡單有效的測試時策略。該策略透過對生成候選結果進行LLM自我分析來識別優劣勢,無需額外訓練、資料或獎勵模型即可顯著提升推理表現。TiViBench與VideoTPO共同為評估與推進影片生成模型的推理能力鋪路,為此新興領域的未來研究奠定基礎。
English
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
PDF424December 1, 2025