建構精準影片語言：人機協同監督機制

摘要

視覺語言模型（VLMs）透過自然語言學習對動態視覺世界進行推理。我們提出一套開放資料集、基準測試工具與可擴展監督方案，旨在實現精準的影片內容描述。首先，我們基於專業影像創作者（如電影製片人）制定的數百項精確定義視覺基元，建立了描述主體、場景、運動、空間與攝影動態的結構化規範。接著，為確保高品質字幕，我們引入CHAI（基於批判的人機協同監督框架），由訓練有素的專家對模型生成的初版字幕進行批判性修訂，產出優化後的終版字幕。這種分工模式將文字生成任務交由模型處理，使人類能更專注於驗證工作，從而提升標註準確度與效率。此外，初版與終版字幕間的批判意見與偏好選擇，為開源模型（Qwen3-VL）提供了豐富的監督信號，可透過SFT（監督微調）、DPO（直接偏好優化）及推論時縮放技術，持續改進字幕生成、獎勵建模與批判生成能力。我們的消融實驗表明，由監督框架保障的批判質量（涵蓋精確度、召回率與建設性）直接決定下游任務表現。在適度專家監督下，最終模型表現超越Gemini-3.1-Pro等閉源模型。最後，我們將此方法應用於大規模專業影片（如電影、廣告、遊戲）的重製字幕任務，並對Wan等影片生成模型進行微調，使其能精準遵循長達400字的詳細提示，實現對攝影技法（包括運鏡、角度、鏡頭、焦點、視角與構圖）的細粒度控制。實驗結果證實，精確規範與人機協同監督是實現專業級影片理解與生成的關鍵要素。資料與程式碼已公開於專案頁面：https://linzhiqiu.github.io/papers/chai/

English

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

建構精準影片語言：人機協同監督機制

Building a Precise Video Language with Human-AI Oversight

摘要

Support