HiMu:面向长视频问答的分层多模态帧选择框架
HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering
March 19, 2026
作者: Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
cs.AI
摘要
長影片問答任務需對延展時序上下文進行推理,使得幀選取策略對於受有限上下文窗口制約的大型視覺語言模型(LVLM)至關重要。現有方法面臨尖銳的取捨困境:基於相似度的選取器雖速度快,但將組合式查詢壓縮為單一稠密向量,導致子事件順序與跨模態關聯丟失;基於智能體的方法通過迭代式LVLM推理恢復此結構,卻伴隨難以承受的計算成本。我們提出HiMu這一無需訓練的框架以彌合此鴻溝:僅需調用純文本LLM一次即可將查詢解構為分層邏輯樹,其葉節點為原子謂詞,每個謂詞被路由至涵蓋視覺(CLIP、開放詞彙檢測、OCR)與音頻(ASR、CLAP)的輕量專家模組。生成的訊號經過歸一化與時序平滑處理以對齊不同模態,並通過強制時序連續性與鄰接性的模糊邏輯算子自底向上組合,最終生成連續的滿足度曲線。在Video-MME、LongVideoBench和HERBench-Lite上的評估表明,HiMu推動了效率-準確率的帕累托前沿:在16幀條件下搭配Qwen3-VL 8B模型時,其表現優於所有競爭性選取器;而搭配GPT-4o時,僅需約十分之一的浮點運算量,即可超越需處理32-512幀的智能體系統。
English
Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.