소형 비전-언어 모델은 장기 영상 이해를 위한 효율적인 압축기 역할을 수행한다

초록

장시간 비디오에 대한 멀티모달 대규모 언어 모델(MLLM)의 적용은 컨텍스트 한계에 의해 병목 현상이 발생합니다. 집약적인 비주얼 스트림은 토큰 예산을 포화시키고 '중간에서의 손실(lost-in-the-middle)' 현상을 악화시킵니다. 희소 샘플링이나 균일 풀링과 같은 기존 휴리스틱 방법은 결정적인 순간들을 제거하고 관련 없는 배경에 대역폭을 낭비함으로써 충실도를 무분별하게 희생합니다. 우리는 다운스트림 이해를 위해 장시간 비디오를 압축하는 효율적인 쿼리 인식 프레임워크인 Tempo를 제안합니다. Tempo는 소형 비전-언어 모델(SVLM)을 지역적 시간적 압축기로 활용하여 토큰 감소를 조기 교차 모달 추출 과정으로 변환하여 단일 순전파로 컴팩트하고 의도에 정렬된 표현을 생성합니다. 인과성을 깨지 않으면서 엄격한 예산을 강화하기 위해 적응형 토큰 할당(ATA)을 도입했습니다. SVLM의 제로샷 관련성 사전 지식과 의미론적 프론트로딩을 활용하는 ATA는 훈련이 필요 없는 O(1) 동적 라우터로 작동합니다. 이는 쿼리-중요 세그먼트에 집약적인 대역폭을 할당하는 동시에 중복성을 최소한의 시간적 앵커로 압축하여 글로벌 스토리라인을 유지합니다. 광범위한 실험을 통해 우리의 6B 아키텍처가 공격적인 동적 압축(0.5-16 토큰/프레임)으로 최첨단 성능을 달성함을 보여줍니다. 극한의 장시간 LVBench(4101초)에서 Tempo는 엄격한 8K 비주얼 예산 하에서 52.3점을 기록하며 GPT-4o 및 Gemini 1.5 Pro를 능가했습니다. 2048 프레임으로 확장 시 53.7점에 도달했습니다. 중요한 것은, Tempo가 장시간 비디오를 이론적 한계보다 상당히 낮게 압축하여 진정한 장형 비디오 이해가 탐욕적으로 채워진 컨텍스트 창이 아닌 의도 주도적 효율성에 의존함을 입증했다는 점입니다.

English

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

소형 비전-언어 모델은 장기 영상 이해를 위한 효율적인 압축기 역할을 수행한다

Small Vision-Language Models are Smart Compressors for Long Video Understanding

초록

Support