LongVPO：从锚定线索到自推理的长视频偏好优化

摘要

我们提出LongVPO——一种新颖的两阶段直接偏好优化框架，可使短上下文视觉语言模型无需任何长视频标注即可稳健理解超长视频。在第一阶段，我们通过将问题锚定到单个短视频片段、穿插干扰片段，并应用视觉相似性和问题特异性过滤来合成偏好三元组，从而消除位置偏差并确保明确的监督。我们还通过仅评估锚定片段来近似参考模型在长上下文中的评分，显著降低计算开销。在第二阶段，我们对长视频采用递归描述流水线生成场景级元数据，随后利用大语言模型构建多片段推理查询与负向响应，通过多片段推理任务校准模型偏好。仅使用16K个合成样本且无需昂贵人工标注，LongVPO在多个长视频基准测试中超越最先进的开源模型，同时保持强大的短视频性能（如在MVBench上），为高效长视频理解提供了可扩展的范式。

English

We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.

LongVPO：从锚定线索到自推理的长视频偏好优化

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

摘要

Support