LiteFrame: 비디오 LLM에서 프레임 스케일링을 가능케 하는 효율적 비전 인코더

초록

비디오 대규모 언어 모델(Video LLM)을 장시간 비디오로 확장하는 근본적인 과제는 시각적 토큰 컨텍스트 길이의 폭발적 증가를 관리하는 데 있다. 기존 전략은 주로 LLM의 계산 오버헤드를 완화하기 위해 특징 추출 후 시각적 토큰을 줄이는 '사후적(post-hoc)' 토큰 감소에 초점을 맞춘다. 이러한 방법들은 시각적 토큰 수를 효과적으로 줄이지만, 주요 지연 병목 현상이 LLM에서 비전 인코더의 프레임당 고비용 처리로 이동한다는 점을 관찰했다. 이를 해결하기 위해 우리는 Video LLM을 위한 강력하면서도 고효율의 비디오 인코더 백본인 LiteFrame을 도입한다. LiteFrame을 학습시키기 위해, 우리는 압축 토큰 증류(CTD)라는 새로운 학습 프레임워크를 제안한다. 이는 대형 교사 비전 모델이 생성한 정보 밀도가 높은 시공간적 압축 표현을 소형 학생 비전 인코더가 직접 예측하도록 가르쳐, 불필요한 계산을 효과적으로 우회한다. 이후 추가적인 언어 모델 적응(LMA)과 결합하면, 이 접근법은 새로운 지연-정확도 파레토 최적 경계를 창출한다. InternVL3-8B와 비교하여 LiteFrame은 8배 더 많은 프레임을 처리하면서 종단 간 지연 시간을 35% 줄이고, 여러 벤치마크에서 평균 비디오 이해 정확도를 향상시킨다. 이러한 결과는 고정된 계산 예산 하에서 장시간 비디오 이해를 가능하게 하는 새로운 잠재적 경로를 제시한다.

English

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8times more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.