Video2LoRA: 비전-언어 모델을 위한 파라메트릭 비디오 내재화

초록

비전-언어 모델에서 비디오를 처리하는 것은 비용이 많이 든다: 각 프레임은 수백 개의 토큰을 차지하며, 추론 비용은 모든 프레임과 반복된 쿼리에 따라 증가한다. 본 논문에서는 파라메트릭 비디오 내재화 방법인 Video2LoRA를 소개한다. 인지자 하이퍼네트워크(perceiver hypernetwork)는 고정된 VLM이 비디오를 인코딩할 때 계층별로 생성되는 중간 표현을 읽어, 단일 순방향 패스로 저랭크 적응(LoRA) 어댑터를 생성한다. 반복적 그래디언트 업데이트가 필요한 표준 LoRA 미세 조정과 달리, Video2LoRA는 비디오로부터 직접 이러한 가중치를 예측한다. SmolVLM2 500M 및 2.2B 모델을 비디오 요약 및 캡셔닝에 대해 훈련시킨 Video2LoRA는, 동일한 고정 VLM이 쿼리 시점에 컨텍스트 내 시각적 토큰이 전혀 없이 어댑터만으로 쿼리에 응답할 수 있게 한다. Video2LoRA는 두 모델 규모의 모든 다섯 가지 캡셔닝 벤치마크와 여덟 가지 비디오 질의응답 벤치마크-모델 규모 쌍 중 일곱 가지에서 직접적인 비디오-인-컨텍스트 추론과 통계적으로 비열등하며 동등하다. 12프레임, 384px에서만 훈련되었음에도 불구하고, 직접적인 비디오-인-컨텍스트 추론이 종종 성능이 저하되는 1,024프레임 및 1,024px까지 안정적으로 유지된다. 이 전반에 걸쳐 응답 시점의 시각적 토큰 부하를 최대 1,500배, 쿼리 TTFT를 6~80배 줄이면서도 비디오에 충실한 출력을 유지한다. 또한, 비중첩 비디오 세그먼트에 대해 독립적으로 생성된 어댑터가 랭크 공간에서 합성될 수 있음을 발견했으며, 이는 청크 단위 장편 비디오 내재화를 위한 경로를 시사한다.

English

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Video2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Video2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Video2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.