ChatPaper.aiChatPaper

RotBench:评估多模态大语言模型在图像旋转识别上的表现

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

August 19, 2025
作者: Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
cs.AI

摘要

我们探究了多模态大语言模型(MLLMs)在准确识别输入图像旋转角度(0°、90°、180°、270°)方面的能力。这一任务要求模型具备强大的视觉推理能力,以检测旋转线索并理解图像内部的空间关系,无论其方向如何。为评估MLLMs在此类能力上的表现,我们引入了RotBench——一个包含350张经过人工筛选的生活、肖像和风景图像的基准测试集。尽管任务看似简单,我们发现包括GPT-5、o3和Gemini-2.5-Pro在内的多个先进开源及专有MLLMs,均无法可靠地识别输入图像的旋转状态。为模型提供辅助信息——如图像描述、深度图等——或采用思维链提示,仅带来微小且不稳定的改进。我们的结果表明,大多数模型能可靠识别正置(0°)图像,部分模型能识别倒置(180°)图像,但无一能可靠区分90°与270°旋转。同时展示不同旋转方向的图像,对推理模型带来中等程度的性能提升,而采用投票机制的改进设置则提升了较弱模型的表现。进一步研究表明,微调虽显著提升了模型识别180°旋转图像的能力,却未能改善其区分90°与270°旋转的能力。这些结果共同揭示了MLLMs在空间推理能力与人类感知旋转方面存在的显著差距。
English
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
PDF12August 20, 2025