FutureOmni: Evaluatie van toekomstvoorspelling op basis van omni-modale context voor multimodale LLM's

Samenvatting

Hoewel Multimodale Large Language Models (MLLM's) een sterke omnimodale perceptie vertonen, blijft hun vermogen om toekomstige gebeurtenissen te voorspellen op basis van audiovisuele aanwijzingen grotendeels onontgonnen, aangezien bestaande benchmarks zich voornamelijk richten op retrospectief begrip. Om deze kloof te overbruggen, introduceren we FutureOmni, de eerste benchmark die is ontworpen om omnimodale toekomstvoorspelling vanuit audiovisuele omgevingen te evalueren. De geëvalueerde modellen moeten cross-modale causale en temporele redenering uitvoeren, evenals effectief gebruikmaken van interne kennis om toekomstige gebeurtenissen te voorspellen. FutureOmni is geconstrueerd via een schaalbare, door een LLM-ondersteunde pijplijn met menselijke betrokkenheid en bevat 919 video's en 1.034 meerkeuzevragen over 8 primaire domeinen. Evaluaties van 13 omnimodale en 7 uitsluitend op video gebaseerde modellen tonen aan dat huidige systemen moeite hebben met audiovisuele toekomstvoorspelling, met name in scenario's met veel spraak, waarbij de beste nauwkeurigheid van 64,8% wordt behaald door Gemini 3 Flash. Om deze beperking te verlichten, hebben we een instructie-afstembare dataset van 7.000 voorbeelden samengesteld en stellen we een Omni-Modal Future Forecasting (OFF) trainingsstrategie voor. Evaluaties op FutureOmni en populaire audiovisuele en uitsluitend op video gebaseerde benchmarks tonen aan dat OFF de toekomstvoorspelling en generalisatie verbetert. We geven alle code (https://github.com/OpenMOSS/FutureOmni) en datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni) openbaar vrij.

English

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

FutureOmni: Evaluatie van toekomstvoorspelling op basis van omni-modale context voor multimodale LLM's

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Samenvatting

Support