HiMu: 장기 비디오 질의응답을 위한 계층적 멀티모달 프레임 선택

초록

장편 비디오 질의응답은 장기간의 시간적 맥락에 대한 추론을 필요로 하며, 이는 제한된 컨텍스트 윈도우를 가진 대규모 시각-언어 모델(LVLM)에게 프레임 선택을 매우 중요하게 만듭니다. 기존 방법들은 날카로운 트레이드오프에 직면해 있습니다: 유사도 기반 선택자는 빠르지만 구성적 질의를 단일한 밀집 벡터로 축소하여 하위 이벤트 순서와 교차 모드 결합을 잃어버립니다. 에이전트 기반 방법은 반복적인 LVLM 추론을 통해 이 구조를 회복하지만, 엄청난 비용이 듭니다. 우리는 이러한 격차를 해소하는 학습이 필요 없는 프레임워크인 HiMu를 소개합니다. 단일 텍스트 전용 LLM 호출로 질의를 계층적 논리 트리로 분해하며, 이 트리의 리프 노드는 원자적 술어로 구성되고 각각은 시각(CLIP, 개방형 어휘 검출, OCR) 및 오디오(ASR, CLAP) 영역을 아우르는 경량 전문가 모듈로 라우팅됩니다. 생성된 신호는 정규화되고, 서로 다른 모드alities를 정렬하기 위해 시간적으로 평활화되며, 시간적 순서와 인접성을 강제하는 퍼지 논리 연산자를 통해 상향식으로 구성되어 연속적인 만족도 곡선을 생성합니다. Video-MME, LongVideoBench 및 HERBench-Lite에 대한 평가 결과, HiMu가 효율성-정확도 파레토 프론트를 발전시킴을 보여줍니다: Qwen3-VL 8B 모델에 16프레임을 사용할 때 모든 경쟁 선택자들을 능가하며, GPT-4o를 사용할 때는 32-512프레임으로 동작하는 에이전트 시스템들을 능가하는 동시에 약 10배 적은 FLOPs를 요구합니다.

English

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

HiMu: 장기 비디오 질의응답을 위한 계층적 멀티모달 프레임 선택

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

초록

Support