ChatPaper.aiChatPaper

FutureOmni:評估多模態大型語言模型基於全模態脈絡的未來預測能力

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

January 20, 2026
作者: Qian Chen, Jinlan Fu, Changsong Li, See-Kiong Ng, Xipeng Qiu
cs.AI

摘要

儘管多模態大型語言模型(MLLMs)展現出強大的全模態感知能力,但其基於視聽線索預測未來事件的能力仍鮮有探索,現有基準主要聚焦於回顧性理解。為彌合此差距,我們推出首個專注於評估視聽環境中全模態未來預測的基準FutureOmni。被評估模型需進行跨模態因果與時序推理,並有效利用內部知識來預測未來事件。FutureOmni通過可擴展的大型語言模型輔助、人機協同流程構建,涵蓋8個主要領域的919個影片與1,034組多選問答對。對13個全模態模型與7個純視訊模型的評估顯示,當前系統在視聽未來預測(尤其是語音密集型場景)表現欠佳,最佳準確率僅64.8%(由Gemini 3 Flash達成)。為突破此限制,我們策劃了包含7K樣本的指令微調資料集,並提出全模態未來預測(OFF)訓練策略。在FutureOmni及主流視聽/純視訊基準上的測試表明,OFF能有效提升未來預測能力與泛化性能。我們公開釋出所有程式碼(https://github.com/OpenMOSS/FutureOmni)與資料集(https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni)。
English
Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).
PDF271January 22, 2026