ChatPaper.aiChatPaper

长视频偏好优化:从锚定线索到自主推理的演进之路

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

February 2, 2026
作者: Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng, Lingxue Song, Xi Chen, Liang Li, Limin Wang
cs.AI

摘要

我们提出LongVPO——一种新颖的两阶段直接偏好优化框架,可使短上下文视觉语言模型无需任何长视频标注即可稳健理解超长视频。在第一阶段,我们通过将问题锚定到独立短视频片段、穿插干扰项,并应用视觉相似性与问题特异性过滤来合成偏好三元组,从而消除位置偏差并确保明确的监督信号。同时通过仅评估锚定片段来近似参考模型对长上下文的评分,显著降低计算开销。第二阶段采用递归描述流程生成长视频的场景级元数据,继而利用大语言模型构建多片段推理查询与负向响应,通过多片段推理任务对齐模型偏好。仅使用1.6万个合成样本且无需昂贵人工标注,LongVPO在多个长视频基准测试中超越最先进开源模型,同时保持强劲的短视频性能(如MVBench),为高效长视频理解提供了可扩展的范式。
English
We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.
PDF11February 6, 2026