见快亦见慢：学习视频中的时间流

摘要

如何判断视频是否被加速或放慢？如何生成不同速度的视频？尽管视频一直是现代计算机视觉研究的核心，但人们对时间流逝的感知与控制却鲜有关注。本文通过将时间作为可学习的视觉概念，开发了能够推理并操控视频时间流的模型。我们首先利用视频中天然存在的多模态线索与时间结构，以自监督方式学习检测速度变化并估计播放速率。随后研究表明，这些习得的时间推理模型使我们能够从嘈杂的真实场景源中构建出迄今规模最大的慢动作视频数据集。这类通常由高速摄像机拍摄的慢动作影像，比标准视频包含更丰富的时间细节。基于这些数据，我们进一步开发了具备时序控制能力的模型，包括可根据指定播放速度生成对应运动的速度条件视频生成模型，以及能将低帧率模糊视频转换为具有精细时间细节的高帧率序列的时间超分辨率模型。我们的研究凸显了时间作为视频学习中可操控的感知维度，为时序可控视频生成、时序取证检测以及构建能理解事件随时间演变机制的更丰富世界模型开辟了新路径。

English

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

见快亦见慢：学习视频中的时间流

Seeing Fast and Slow: Learning the Flow of Time in Videos

摘要

Support