時光迷失:多模式LLM中時鐘和日曆理解挑戰
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
February 7, 2025
作者: Rohit Saxena, Aryo Pradipta Gema, Pasquale Minervini
cs.AI
摘要
從視覺表徵中理解時間是一項基本的認知技能,然而對於多模式大型語言模型(MLLMs)而言仍然是一個挑戰。在這項研究中,我們探討了MLLMs在通過類比時鐘和年曆來解釋時間和日期的能力。為了促進這一點,我們精心編制了一個結構化數據集,包括兩個子集:1)ClockQA,其中包括各種時鐘風格-標準、黑面盤、無秒針、羅馬數字和箭頭針時鐘-配對時間相關問題;以及2)CalendarQA,其中包含年曆圖像,並提出問題範圍從眾所周知的日期(例如聖誕節、元旦)到計算得出的日期(例如一年中的第100或第153天)。我們的目標是分析當MLLMs面對與時間相關的視覺數據時,它們如何執行視覺識別、數值推理和時間推斷。我們的評估顯示,儘管最近取得了進展,但對於MLLMs來說,可靠地理解時間仍然是一個重大挑戰。
English
Understanding time from visual representations is a fundamental cognitive
skill, yet it remains a challenge for multimodal large language models (MLLMs).
In this work, we investigate the capabilities of MLLMs in interpreting time and
date through analogue clocks and yearly calendars. To facilitate this, we
curated a structured dataset comprising two subsets: 1) ClockQA,
which comprises various types of clock styles-standard, black-dial,
no-second-hand, Roman numeral, and arrow-hand clocks-paired with time related
questions; and 2) CalendarQA, which consists of yearly calendar
images with questions ranging from commonly known dates (e.g., Christmas, New
Year's Day) to computationally derived ones (e.g., the 100th or 153rd day of
the year). We aim to analyse how MLLMs can perform visual recognition,
numerical reasoning, and temporal inference when presented with time-related
visual data. Our evaluations show that despite recent advancements, reliably
understanding time remains a significant challenge for MLLMs.Summary
AI-Generated Summary