人間とAIの監視による精密な映像言語の構築

要旨

ビデオ言語モデル（VLM）は、自然言語を通じて動的な視覚世界を推論することを学習する。本論文では、精密なビデオキャプション生成を可能にする、スケーラブルな監視のためのオープンデータセット、ベンチマーク、およびレシピ一式を提案する。まず、映像作家などのプロフェッショナルなビデオクリエイターと共同で開発した数百の厳密に定義された視覚的プリミティブに基づき、被写体、シーン、動き、空間的・カメラの動態を記述する構造化された仕様を定義する。次に、高品質なキャプションを精選するため、訓練を受けた専門家がモデル生成のプレキャプションを批判・修正して改善されたポストキャプションを作成するCHAI（批判に基づく人間-AI監視）フレームワークを導入する。この役割分担により、テキスト生成をモデルに委譲することで注釈の精度と効率が向上し、人間は検証に集中できる。さらに、プレキャプションとポストキャプション間の批判と選好は、SFT、DPO、推論時のスケーリングを通じて、キャプション生成、報酬モデリング、批判生成のオープンソースモデル（Qwen3-VL）を改善するための豊富な教師信号を提供する。アブレーションスタディにより、監視フレームワークで確保された批判の質（精度、再現率、建設性）が下流タスクの性能を直接決定することが示される。最小限の専門家監視により、最終モデルはGemini-3.1-Proなどのクローズドソースモデルを凌駕する。最後に、本アプローチを大規模なプロフェッショナルビデオ（映画、CM、ゲームなど）の再キャプション化に適用し、Wanなどのビデオ生成モデルを400語に及ぶ詳細なプロンプトに従うようファインチューニングすることで、カメラの動き、角度、レンズ、焦点、視点、フレーミングを含む映像制作をより細かく制御可能にした。結果は、精密な仕様定義と人間-AI協調監視がプロレベルのビデオ理解と生成の鍵であることを示す。データとコードはプロジェクトページ（https://linzhiqiu.github.io/papers/chai/）で公開されている。

English

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

人間とAIの監視による精密な映像言語の構築

Building a Precise Video Language with Human-AI Oversight

要旨

Support