HiMu：面向长视频问答的分层多模态帧选择框架

摘要

长视频问答任务需对长时序上下文进行推理，使得受限于有限上下文窗口的大型视觉语言模型（LVLM）的帧选择策略尤为关键。现有方法面临严峻权衡：基于相似度的选择器虽快，但将组合式查询压缩为单一稠密向量，丢失了子事件顺序与跨模态关联；基于智能体的方法通过迭代式LVLM推理恢复结构，但计算成本高昂。我们提出HiMu这一免训练框架以弥合鸿沟：通过单次纯文本LLM调用将查询解构为层次化逻辑树，其叶节点为原子谓词，每个谓词路由至轻量级专家模块（涵盖视觉领域的CLIP、开放词汇检测、OCR及音频领域的ASR、CLAP）。生成的信号经归一化与时序平滑处理以对齐多模态数据，再通过强制时序顺序与邻接关系的模糊逻辑算子自底向上组合，最终生成连续满足度曲线。在Video-MME、LongVideoBench和HERBench-Lite上的评估表明，HiMu推进了效率-准确率的帕累托前沿：在16帧条件下，Qwen3-VL 8B版本优于所有竞争性选择器；搭配GPT-4o时，其性能超越运行在32-512帧的智能体系统，且计算量减少约10倍。

English

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.