비디오 스트림에 대한 테스트 타임 트레이닝

초록

기존 연구에서는 테스트 시간 학습(Test-Time Training, TTT)을 훈련된 모델을 테스트 시간에 추가로 개선하기 위한 일반적인 프레임워크로 확립했습니다. 각 테스트 인스턴스에 대한 예측을 수행하기 전에, 모델은 마스크된 오토인코더를 사용한 이미지 재구성과 같은 자기 지도 학습 작업을 통해 동일한 인스턴스에서 훈련됩니다. 우리는 TTT를 스트리밍 환경으로 확장합니다. 여기서는 여러 테스트 인스턴스(이 경우 비디오 프레임)가 시간 순서대로 도착합니다. 우리의 확장은 온라인 TTT입니다: 현재 모델은 이전 모델에서 초기화된 후, 현재 프레임과 바로 이전의 작은 프레임 윈도우에서 훈련됩니다. 온라인 TTT는 세 가지 실제 데이터셋에서 네 가지 작업에 대해 고정 모델 기준선을 크게 능가합니다. 인스턴스 분할과 파노픽 분할에서 상대적 개선은 각각 45%와 66%입니다. 놀랍게도, 온라인 TTT는 시간 순서에 관계없이 전체 테스트 비디오의 모든 프레임에서 훈련하는 오프라인 변형보다도 더 나은 성능을 보입니다. 이는 합성 비디오를 사용한 이전 연구 결과와는 다른 결과입니다. 우리는 온라인 TTT가 오프라인 TTT보다 우위에 있는 이유를 지역성(locality)으로 개념화합니다. 우리는 지역성의 역할을 ablation 연구와 편향-분산 트레이드오프 이론을 통해 분석합니다.

English

Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.

비디오 스트림에 대한 테스트 타임 트레이닝

Test-Time Training on Video Streams

초록

Support