Video2LoRA：視覺語言模型的參數化影片內化

摘要

在視覺語言模型中處理影片的運算成本高昂：每一幀畫面佔用數百個代幣，而推論成本隨著每一幀和每一次重複查詢而增加。我們提出 Video2LoRA，一種用於參數化影片內化的方法。一個感知器超網路在凍結的視覺語言模型編碼影片時，逐層讀取其產生的中間表徵，並在單次前向傳遞中生成一個低秩適應（LoRA）適配器。與需要迭代梯度更新的標準 LoRA 微調不同，Video2LoRA 直接從影片預測這些權重。針對 SmolVLM2 500M 和 2.2B 模型，在影片摘要與字幕生成任務上進行訓練後，Video2LoRA 使同一凍結的視覺語言模型僅需透過該適配器即可回答查詢，而在查詢階段的上下文內無需任何視覺代幣。在所有五個字幕生成基準測試的兩個模型規模下，以及在八個影片問答基準測試-模型規模配對中的七個上，Video2LoRA 在統計上非劣且等效於直接將影片納入上下文的推論方法。儘管僅在 12 幀、384像素的設定下訓練，該方法在 1,024 幀及 1,024 像素時仍保持穩定，而直接將影片納入上下文的推論在此情況下常出現退化。在這一範圍內，它將回答階段的視覺代幣負載降低多達 1,500 倍，將查詢的首個代幣時間（TTFT）降低 6 至 80 倍，同時維持忠於影片的輸出。我們還發現，為非重疊影片片段獨立生成的適配器可在秩空間中組合，這表明了一條邁向分塊長影片內化的途徑。

English

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Video2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Video2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Video2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.