CyberV: 비디오 이해를 위한 테스트 시간 스케일링을 위한 사이버네틱스

초록

현재의 다중모달 대형 언어 모델(MLLMs)은 테스트 시점의 계산적 요구, 견고성 부족, 그리고 주로 피드포워드 처리 방식에서 기인하는 정확도 한계로 인해 길거나 복잡한 비디오를 이해하는 데 어려움을 겪을 수 있습니다. 이러한 한계는 매개변수가 적은 모델에서 더 심각할 수 있습니다. 이러한 한계를 해결하기 위해, 우리는 사이버네틱 원칙에서 영감을 받아 비디오 MLLMs를 추론 중에 자가 모니터링, 자가 수정, 그리고 동적 자원 할당이 가능한 적응형 시스템으로 재설계하는 새로운 프레임워크를 제안합니다. 우리의 접근 방식인 CyberV는 MLLM 추론 시스템, 센서, 그리고 컨트롤러로 구성된 사이버네틱 루프를 도입합니다. 구체적으로, 센서는 MLLM의 전방향 프로세스를 모니터링하고 주의력 이탈과 같은 중간 해석을 수집한 후, 컨트롤러는 자가 수정을 언제 어떻게 트리거할지 결정하고 다음 라운드를 안내하기 위한 피드백을 생성합니다. 이 테스트 시점 적응형 스케일링 프레임워크는 재훈련이나 추가 구성 요소 없이도 고정된 MLLMs를 향상시킵니다. 실험 결과, CyberV는 VideoMMMU에서 Qwen2.5-VL-7B를 8.3%, InternVL3-8B를 5.5% 향상시켜 경쟁력 있는 독점 모델 GPT-4o를 능가했습니다. Qwen2.5-VL-72B에 적용했을 때는 10.0%의 향상을 이루며 인간 전문가와도 비견할 만한 성능을 달성했습니다. 또한, 우리의 방법은 VideoMME와 WorldSense와 같은 일반 목적 벤치마크에서도 일관된 성능 향상을 보여주며, 동적 비디오 이해를 위해 MLLMs를 더 견고하고 정확하게 만드는 데 있어 그 효과성과 일반화 능력을 입증했습니다. 코드는 https://github.com/marinero4972/CyberV에서 공개되었습니다.

English

Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.

CyberV: 비디오 이해를 위한 테스트 시간 스케일링을 위한 사이버네틱스

CyberV: Cybernetics for Test-time Scaling in Video Understanding

초록

Support