효율적인 옴니모달 LLM을 위한 단계 적응적 토큰 선택

초록

옴니모달 대규모 언어 모델(om-LLM)은 비디오와 오디오를 윈도우 수준에서 인터리브된 시간적으로 정렬된 토큰 시퀀스로 인코딩하여 통합된 시청각 이해를 달성한다. 그러나 이러한 밀집된 비텍스트 토큰을 LLM 전체에서 처리하는 것은 상당한 계산 오버헤드를 초래한다. 학습 없는 토큰 선택이 이러한 비용을 줄일 수 있지만, 기존 방법은 시각 입력만을 대상으로 하거나 고정된 모달리티별 비율로 LLM 이전에만 om-LLM 토큰을 제거하여, 교차 모달리티 토큰 중요도가 층에 따라 어떻게 변화하는지 포착하지 못한다. 이러한 한계를 해결하기 위해, 우리는 먼저 om-LLM의 층별 토큰 의존성을 분석한다. 시각 및 오디오 의존성은 블록 단위 패턴을 따르며 층이 깊어짐에 따라 점차 약화되는데, 이는 교차 모달리티 융합 이후 많은 후반부 층의 비텍스트 토큰이 중복됨을 나타낸다. 이러한 관찰에 기반하여, 우리는 효율적인 om-LLM 추론을 위한 학습 없는 단계 적응형 토큰 선택 방법인 SEATS를 제안한다. SEATS는 LLM 이전에 주의 기반 다양성 선택을 통해 시공간적 중복성을 제거한다. LLM 내부에서는 블록 전체에 걸쳐 토큰을 점진적으로 제거하고, 질의 관련성 점수를 사용하여 시간 윈도우에서 모달리티로 유지 예산을 동적으로 할당한다. 후반부 층에서는 교차 모달리티 융합이 완료되면 모든 남은 비텍스트 토큰을 제거한다. Qwen2.5-Omni 및 Qwen3-Omni에 대한 실험은 SEATS가 추론 효율성을 효과적으로 향상시킴을 보여준다. 시각 및 오디오 토큰의 10%만 유지하면서 원래 성능의 96.3%를 보존하며 9.3배의 FLOPs 감소와 4.8배의 프리필 속도 향상을 달성한다.

English

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.