视频流上的测试时间训练

摘要

先前的研究已经确立了测试时训练（TTT）作为一个通用框架，可以在测试时进一步改进已训练模型。在对每个测试实例进行预测之前，模型会使用自监督任务（例如使用遮罩自编码器进行图像重建）在相同实例上进行训练。我们将TTT扩展到流式设置，其中多个测试实例（在我们的情况下是视频帧）按时间顺序到达。我们的扩展是在线TTT：当前模型从先前模型初始化，然后在当前帧和前面立即的一小窗口帧上进行训练。在线TTT在四项任务上显著优于固定模型基准，在三个真实世界数据集上。相对改进为45%和66%，分别用于实例分割和全景分割。令人惊讶的是，在线TTT还优于其离线变体，后者访问更多信息，即训练所有帧的整个测试视频，而不考虑时间顺序。这与先前使用合成视频的发现不同。我们将局部性概念化为在线优于离线TTT的优势。我们通过消融实验和基于偏差-方差权衡的理论分析了局部性的作用。

English

Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.

视频流上的测试时间训练

Test-Time Training on Video Streams

摘要

Support