인간-AI 협업 감독을 통한 정밀한 비디오 언어 구축

초록

비디오-언어 모델(VLM)은 자연어를 통해 역동적인 시각 세계를 추론하는 방법을 학습합니다. 본 연구는 정밀한 비디오 캡션 생성을 가능하게 하는 확장적 감독을 위한 오픈 데이터셋, 벤치마크, 그리고 레시피 모음을 소개합니다. 먼저, 영화 제작자와 같은 전문 비디오 크리에이터와 함께 개발한 수백 개의 신중하게 정의된 시각적 기본 요소에 기반하여, 대상, 장면, 동작, 공간 및 카메라 역학을 설명하기 위한 구조화된 명세를 정의합니다. 다음으로, 고품질 캡션을 선별하기 위해 CHAI(Critique-based Human-AI Oversight)를 도입합니다. 이는 훈련된 전문가가 모델이 생성한 사전 캡션을 비판 및 수정하여 개선된 사후 캡션으로 만드는 프레임워크입니다. 이러한 분업은 텍스트 생성을 모델에 위임함으로써 주석 정확도와 효율성을 높이고, 인간이 검증에 더 집중할 수 있게 합니다. 또한, 사전/사후 캡션 간의 이러한 비판과 선호도는 SFT, DPO 및 추론 시 스케일링을 통해 캡션 생성, 보상 모델링, 비판 생성을 개선하는 오픈소스 모델(Qwen3-VL)을 향상시키는 풍부한 감독 정보를 제공합니다. 우리의 애블레이션 연구는 감독 프레임워크로 보장된 정밀도, 재현율, 건설성 측면의 비판 품질이 하류 작업 성능을 직접적으로 좌우함을 보여줍니다. 적절한 수준의 전문가 감독을 통해 얻어진 결과 모델은 Gemini-3.1-Pro와 같은 클로즈드소스 모델을 능가합니다. 마지막으로, 우리의 접근법을 대규모 전문 비디오(예: 영화, 광고, 게임)에 재캡션 처리하고 Wan과 같은 비디오 생성 모델을 파인튜닝하여 최대 400단어에 이르는 상세한 프롬프트를 더 잘 따르도록 하여, 카메라 모션, 앵글, 렌즈, 초점, 시점, 프레이밍을 포함한 촬영 기법에 대한 더 정교한 제어를 달성했습니다. 우리의 결과는 정밀한 명세와 인간-AI 협력 감독이 전문가 수준의 비디오 이해 및 생성의 핵심임을 보여줍니다. 데이터와 코드는 프로젝트 페이지(https://linzhiqiu.github.io/papers/chai/)에서 확인할 수 있습니다.

English

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

인간-AI 협업 감독을 통한 정밀한 비디오 언어 구축

Building a Precise Video Language with Human-AI Oversight

초록

Support