构建精准视频语言：人机协同监督机制

摘要

视频语言模型（VLMs）通过学习自然语言实现对动态视觉世界的推理。我们推出一套开源数据集、基准测试及可扩展的监督方案，旨在实现精准的视频描述。首先，我们基于与专业视频创作者（如电影制作人）共同制定的数百项精确定义的视觉基元，建立了描述主体、场景、运动、空间及摄像机动态的结构化规范。其次，为筛选高质量描述文本，我们提出CHAI（基于批判的人机协同监督框架），由训练有素的专家对模型生成的初版描述进行批判性修订，形成优化后的终版描述。这种分工模式将文本生成任务交由模型处理，使人类更专注于校验环节，从而提升标注精度与效率。此外，初版与终版描述间的批判意见与偏好选择，为通过SFT、DPO及推理时缩放等技术优化开源模型（如Qwen3-VL）的描述生成、奖励建模和批判生成能力提供了丰富监督信号。消融实验表明，监督框架所保障的批判质量（精确度、召回率与建设性）直接决定下游任务性能。在有限专家监督下，所得模型性能已超越Gemini-3.1-Pro等闭源模型。最后，我们将该方法应用于大规模专业视频（如电影、广告、游戏）的重新描述，并对Wan等视频生成模型进行微调，使其能更好遵循长达400词的详细提示，实现对摄影技法（包括摄像机运动、角度、镜头、焦点、视角与构图）的更精细控制。实验结果表明，精准的规范定义与人机协同监督是实现专业级视频理解与生成的关键。数据与代码详见项目页面：https://linzhiqiu.github.io/papers/chai/

English

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

构建精准视频语言：人机协同监督机制

Building a Precise Video Language with Human-AI Oversight

摘要

Support