ChatPaper.aiChatPaper

FutureOmni:面向多模态大模型的全局情境未来预测能力评估

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

January 20, 2026
作者: Qian Chen, Jinlan Fu, Changsong Li, See-Kiong Ng, Xipeng Qiu
cs.AI

摘要

尽管多模态大语言模型(MLLMs)展现出强大的全模态感知能力,但其基于视听线索预测未来事件的能力仍鲜有探索,因为现有基准主要关注回顾性理解。为填补这一空白,我们推出了FutureOmni——首个专为评估基于视听环境的全模态未来预测而设计的基准。被评估模型需具备跨模态因果与时序推理能力,并能有效利用内部知识预测未来事件。FutureOmni通过可扩展的大语言模型辅助、人机协同流程构建,涵盖8个主要领域的919个视频和1,034个多选问答对。对13个全模态模型和7个纯视频模型的评估表明,当前系统在视听未来预测方面表现欠佳,尤其在语音密集型场景中,最佳准确率仅由Gemini 3 Flash达到64.8%。为突破此局限,我们构建了包含7千样本的指令微调数据集,并提出全模态未来预测(OFF)训练策略。在FutureOmni及主流视听/纯视频基准上的测试表明,OFF能有效提升未来预测能力与泛化性能。我们已公开全部代码(https://github.com/OpenMOSS/FutureOmni)与数据集(https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni)。
English
Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).
PDF271January 22, 2026