ChatPaper.aiChatPaper

HiMu:面向长视频问答的分层多模态帧选择框架

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

March 19, 2026
作者: Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
cs.AI

摘要

长视频问答任务需对长时序上下文进行推理,使得受限于有限上下文窗口的大型视觉语言模型(LVLM)的帧选择策略尤为关键。现有方法面临严峻权衡:基于相似度的选择器虽快,但将组合式查询压缩为单一稠密向量,丢失了子事件顺序与跨模态关联;基于智能体的方法通过迭代式LVLM推理恢复结构,但计算成本高昂。我们提出HiMu这一免训练框架以弥合鸿沟:通过单次纯文本LLM调用将查询解构为层次化逻辑树,其叶节点为原子谓词,每个谓词路由至轻量级专家模块(涵盖视觉领域的CLIP、开放词汇检测、OCR及音频领域的ASR、CLAP)。生成的信号经归一化与时序平滑处理以对齐多模态数据,再通过强制时序顺序与邻接关系的模糊逻辑算子自底向上组合,最终生成连续满足度曲线。在Video-MME、LongVideoBench和HERBench-Lite上的评估表明,HiMu推进了效率-准确率的帕累托前沿:在16帧条件下,Qwen3-VL 8B版本优于所有竞争性选择器;搭配GPT-4o时,其性能超越运行在32-512帧的智能体系统,且计算量减少约10倍。
English
Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.
PDF91March 24, 2026