HiMu：面向长视频问答的分层多模态帧选择框架

摘要

長影片問答任務需對延展時序上下文進行推理，使得幀選取策略對於受有限上下文窗口制約的大型視覺語言模型（LVLM）至關重要。現有方法面臨尖銳的取捨困境：基於相似度的選取器雖速度快，但將組合式查詢壓縮為單一稠密向量，導致子事件順序與跨模態關聯丟失；基於智能體的方法通過迭代式LVLM推理恢復此結構，卻伴隨難以承受的計算成本。我們提出HiMu這一無需訓練的框架以彌合此鴻溝：僅需調用純文本LLM一次即可將查詢解構為分層邏輯樹，其葉節點為原子謂詞，每個謂詞被路由至涵蓋視覺（CLIP、開放詞彙檢測、OCR）與音頻（ASR、CLAP）的輕量專家模組。生成的訊號經過歸一化與時序平滑處理以對齊不同模態，並通過強制時序連續性與鄰接性的模糊邏輯算子自底向上組合，最終生成連續的滿足度曲線。在Video-MME、LongVideoBench和HERBench-Lite上的評估表明，HiMu推動了效率-準確率的帕累托前沿：在16幀條件下搭配Qwen3-VL 8B模型時，其表現優於所有競爭性選取器；而搭配GPT-4o時，僅需約十分之一的浮點運算量，即可超越需處理32-512幀的智能體系統。

English

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

HiMu：面向长视频问答的分层多模态帧选择框架

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

摘要

Support