CyberV:视频理解中测试时自适应的控制论方法
CyberV: Cybernetics for Test-time Scaling in Video Understanding
June 9, 2025
作者: Jiahao Meng, Shuyang Sun, Yue Tan, Lu Qi, Yunhai Tong, Xiangtai Li, Longyin Wen
cs.AI
摘要
当前的多模态大语言模型(MLLMs)在理解长视频或复杂视频时可能面临挑战,这主要源于测试时的高计算需求、缺乏鲁棒性以及准确性有限,这些限制很大程度上归因于其前馈处理机制。对于参数较少的模型,这些限制可能更为严重。为应对这些局限,我们受控制论启发,提出了一种新颖框架,将视频MLLMs重新设计为能在推理过程中自我监控、自我校正并动态分配资源的自适应系统。我们的方法——CyberV,引入了一个由MLLM推理系统、传感器和控制器构成的控制循环。具体而言,传感器监控MLLM的前向过程并收集如注意力漂移等中间解释,随后控制器决定何时及如何触发自我校正,并生成反馈以指导下一轮推理。这一测试时自适应扩展框架无需重新训练或添加额外组件,即可增强冻结的MLLMs。实验显示显著提升:CyberV使Qwen2.5-VL-7B在VideoMMMU上提升了8.3%,InternVL3-8B提升了5.5%,超越了竞争性专有模型GPT-4o。应用于Qwen2.5-VL-72B时,更是带来了10.0%的提升,性能甚至可与人类专家相媲美。此外,我们的方法在通用基准测试如VideoMME和WorldSense上也展现出一致的增益,凸显了其在增强MLLMs对动态视频理解的鲁棒性和准确性方面的有效性及泛化能力。代码已发布于https://github.com/marinero4972/CyberV。
English
Current Multimodal Large Language Models (MLLMs) may struggle with
understanding long or complex videos due to computational demands at test time,
lack of robustness, and limited accuracy, primarily stemming from their
feed-forward processing nature. These limitations could be more severe for
models with fewer parameters. To address these limitations, we propose a novel
framework inspired by cybernetic principles, redesigning video MLLMs as
adaptive systems capable of self-monitoring, self-correction, and dynamic
resource allocation during inference. Our approach, CyberV, introduces a
cybernetic loop consisting of an MLLM Inference System, a Sensor, and a
Controller. Specifically, the sensor monitors forward processes of the MLLM and
collects intermediate interpretations, such as attention drift, then the
controller determines when and how to trigger self-correction and generate
feedback to guide the next round. This test-time adaptive scaling framework
enhances frozen MLLMs without requiring retraining or additional components.
Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B
by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive
proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0%
improvement, achieving performance even comparable to human experts.
Furthermore, our method demonstrates consistent gains on general-purpose
benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and
generalization capabilities in making MLLMs more robust and accurate for
dynamic video understanding. The code is released at
https://github.com/marinero4972/CyberV.