LatentOmni: 통합 오디오-시각 잠재 추론을 통한 옴니모달 이해의 재조명

초록

오디오-시각 공동 추론은 전방태 이해에 필수적이나, 현재의 다중모드 대규모 언어 모델은 미세한 증거가 양쪽 모달리티에서 요구되는 추론에 여전히 어려움을 겪는다. 핵심 한계는 명시적 텍스트 기반 사고 사슬이 연속적인 오디오-시각 신호를 이산적 토큰으로 압축하여 시간적 근거를 약화시키고 중간 추론을 언어적 사전 지식으로 편향시킨다는 점이다. 본 논문은 통일된 잠재 공간이 밀집된 감각 정보를 보존하면서 자기회귀적 생성과 호환성을 유지하기 때문에 이러한 추론에 더 적합한 매체라고 주장한다. 이러한 통찰에 기반하여, 텍스트 추론과 오디오-시각 잠재 상태를 교차 배치하는 교차모달 추론 프레임워크 LatentOmni를 제안한다. LatentOmni는 특징 수준의 감독을 도입하여 잠재 추론 상태를 작업 관련 감각 특징과 정렬시키고, Omni-Sync 위치 임베딩을 사용하여 잠재 오디오 및 시각 상태 간의 시간적 일관성을 유지한다. 또한 잠재 공간 추론을 감독하기 위한 오디오-시각 교차 추론 궤적 데이터셋인 LatentOmni-Instruct-35K를 구축하였다. 여러 오디오-시각 추론 벤치마크에 걸친 포괄적 평가는 LatentOmni가 평가된 오픈소스 모델 중 최고 성능을 달성하고 명시적 텍스트 CoT 기준선을 일관되게 능가함을 보여주며, 잠재 공간 공동 추론이 더 강력한 전방태 이해를 위한 유망한 경로임을 뒷받침한다.

English

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose LatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.