VideoKR: 지식 및 추론 집약적 비디오 이해를 위하여

초록

우리는 VideoKR을 소개한다. 이는 지식 및 추론 집약적 비디오 이해를 강화하기 위해 특별히 설계된 최초의 대규모 훈련 코퍼스이다. 이 코퍼스는 새로 수집된 145,000개의 CC 라이선스 전문 분야 비디오에 대한 315,000개의 비디오 추론 예제로 구성된다. 우리는 인간 참여형 기술 지향적 예제 생성 파이프라인을 개발하여, 점진적으로 더 깊은 수준의 비디오 추론 능력을 목표로 하면서도 예제와 그 CoT 추론 과정의 난이도, 다양성 및 신뢰성을 보장한다. 또한 새로운 전문가 주석 벤치마크인 VideoKR-Eval을 구축하였으며, 여기서 질문은 텍스트적 지름길이 아닌 진정한 비디오 이해와 지식 집약적 추론을 요구한다. 실험 결과, 표준 SFT→GRPO 파이프라인 하에서 VideoKR로 사후 훈련된 모델은 지식 집약적 비디오 추론에서 이전의 사후 훈련 접근법보다 뛰어난 성능을 보였으며, 일반 비디오 추론에서도 경쟁력을 유지하였다. 이는 비디오 추론의 발전에 있어 데이터 설계가 핵심 동인임을 강조한다. 또한 VideoKR의 기여를 분석하기 위해 포괄적인 절제 연구를 수행하여, 향후 연구를 위한 실행 가능한 통찰력을 제공한다.

English

We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFTrightarrowGRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.