RotBench:評估多模態大型語言模型在圖像旋轉識別上的表現
RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
August 19, 2025
作者: Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
cs.AI
摘要
我們探討了多模態大型語言模型(MLLMs)在準確識別輸入圖像旋轉角度(0°、90°、180° 和 270°)方面的能力。這項任務需要強大的視覺推理能力,以檢測旋轉線索並在圖像中定位空間關係,無論其方向如何。為了評估 MLLMs 在這些能力上的表現,我們引入了 RotBench——一個包含 350 張經過人工篩選的生活、肖像和風景圖像的基準測試集。儘管這項任務相對簡單,我們發現包括 GPT-5、o3 和 Gemini-2.5-Pro 在內的幾種最先進的開源和專有 MLLMs 並不能可靠地識別輸入圖像的旋轉。為模型提供輔助信息——包括標題、深度圖等——或使用思維鏈提示僅能帶來微小且不一致的改進。我們的結果表明,大多數模型能夠可靠地識別正置(0°)圖像,而某些模型能夠識別倒置(180°)圖像。但沒有一個模型能夠可靠地區分 90° 和 270° 的旋轉。同時展示圖像在不同方向上的旋轉版本,對於推理模型來說能帶來中等的性能提升,而使用投票的修改設置則能提高較弱模型的表現。我們進一步發現,微調並不能提高模型區分 90° 和 270° 旋轉的能力,儘管它顯著改善了對 180° 圖像的識別。總的來說,這些結果揭示了 MLLMs 在識別旋轉方面的空間推理能力與人類感知之間存在顯著差距。
English
We investigate to what extent Multimodal Large Language Models (MLLMs) can
accurately identify the orientation of input images rotated 0{\deg}, 90{\deg},
180{\deg}, and 270{\deg}. This task demands robust visual reasoning
capabilities to detect rotational cues and contextualize spatial relationships
within images, regardless of their orientation. To evaluate MLLMs on these
abilities, we introduce RotBench -- a 350-image manually-filtered benchmark
comprising lifestyle, portrait, and landscape images. Despite the relatively
simple nature of this task, we show that several state-of-the-art open and
proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably
identify rotation in input images. Providing models with auxiliary information
-- including captions, depth maps, and more -- or using chain-of-thought
prompting offers only small and inconsistent improvements. Our results indicate
that most models are able to reliably identify right-side-up (0{\deg}) images,
while certain models are able to identify upside-down (180{\deg}) images. None
can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing
the image rotated in different orientations leads to moderate performance gains
for reasoning models, while a modified setup using voting improves the
performance of weaker models. We further show that fine-tuning does not improve
models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite
substantially improving the identification of 180{\deg} images. Together, these
results reveal a significant gap between MLLMs' spatial reasoning capabilities
and human perception in identifying rotation.