Video2LoRA: 面向视觉语言模型的参数化视频内化
Video2LoRA: Parametric Video Internalization for Vision-Language Models
June 3, 2026
作者: Manan Suri, Sarvesh Baskar, Dinesh Manocha
cs.AI
摘要
在视觉-语言模型中处理视频成本高昂:每一帧占用数百个令牌,推理成本随每一帧和每次重复查询而增加。我们提出Video2LoRA,一种用于参数化视频内化的方法。感知器超网络在冻结的视觉-语言模型编码视频时,逐层读取其生成的中间表示,并在单次前向传播中生成低秩适配器。与需要迭代梯度更新的标准LoRA微调不同,Video2LoRA直接从视频预测这些权重。该模型针对SmolVLM2 500M和2.2B参数版本进行视频摘要和字幕生成训练后,使得相同的冻结视觉-语言模型能够仅凭适配器回答查询,在查询时上下文中包含零个视觉令牌。在所有五个字幕生成基准测试中,Video2LoRA在两个模型规模上均与直接视频上下文推理在统计上非劣效且等价;在八个视频问答基准测试-模型规模配对中的七个上也是如此。尽管仅基于12帧、分辨率为384像素进行训练,Video2LoRA在多达1024帧和1024像素分辨率下仍保持稳定,而直接视频上下文推理在此条件下往往退化。通过这一扫描实验,它将回答时的视觉令牌负载最多减少1500倍,查询首令牌延迟减少6-80倍,同时保持视频忠实输出。我们还发现,针对非重叠视频片段独立生成的适配器可以在秩空间中组合,这为分块长视频内化开辟了路径。
English
Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Video2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Video2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Video2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.