VLMs는 적응형 테스트 시간 최적화를 통한 비디오 추론에 효과적인 교사이다

초록

최근 "영상 기반 추론(Reasoning with Video)" 패러다임은 비디오 생성 모델(VGM)을 활용하여 시간적으로 일관된 시각적 궤적을 생성함으로써 추론 과제를 완수한다. 최첨단 VGM은 시각적 품질에서 뛰어난 성능을 보이지만, 과제별 규칙을 이해하고 따르는 데 어려움을 겪어 다양한 추론 시나리오에서 논리적 실패를 초래한다. 기존 연구들은 VLM(시각-언어 모델)을 문제 사전 해결사로 활용하여 VGM을 위한 텍스트 기반 지침을 생성하거나 개선하려는 시도를 해왔다. 그러나 텍스트 설명은 복잡한 시공간적 세부 정보를 포착하지 못하며, VGM은 유효한 계획이 있더라도 세밀하거나 긴 꼬리(long-tail) 지침을 충실히 실행하는 데 어려움을 겪는다. VLM은 해결사로서 한계를 보이지만, 과정 제약 충족 및 최종 목표 달성을 평가할 수 있는 강력한 인식 능력을 갖추고 있다. 이러한 강점을 활용하여, 본 연구는 VLM의 역할을 "교사(teacher)"로 전환하는 새로운 패러다임을 제안한다. 구체적으로, VLM 교사가 과제별 규칙을 추출하여 미분 가능한 보상 함수를 설계하고, 가벼운 LoRA 모듈의 테스트 시점 온라인 최적화를 통해 VGM 추론기를 유도한다. 이 전략은 적응형 테스트 시점 최적화를 가능하게 하여 VGM 고유의 한계를 넘어 추론 능력을 확장한다. 기호 기반(VBVR-Bench) 및 범용(RULER-Bench) 비디오 추론 벤치마크 평가에서 제안된 방법은 평균 16.7포인트의 성능 향상을 보였으며, VLM-해결사 패러다임(+0.4포인트) 및 Best-of-N 스케일링(+2.2포인트)을 유사한 테스트 시점 비용에서 큰 폭으로 능가했다. 이러한 결과는 VLM을 테스트 시점 교사로 통합하는 것이 일반화 가능한 비디오 추론을 위한 유망한 패러다임임을 보여준다. 프로젝트 페이지: https://VLM-as-Teacher.github.io/

English

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/