RotBench：评估多模态大语言模型在图像旋转识别上的表现

摘要

我们探究了多模态大语言模型（MLLMs）在准确识别输入图像旋转角度（0°、90°、180°、270°）方面的能力。这一任务要求模型具备强大的视觉推理能力，以检测旋转线索并理解图像内部的空间关系，无论其方向如何。为评估MLLMs在此类能力上的表现，我们引入了RotBench——一个包含350张经过人工筛选的生活、肖像和风景图像的基准测试集。尽管任务看似简单，我们发现包括GPT-5、o3和Gemini-2.5-Pro在内的多个先进开源及专有MLLMs，均无法可靠地识别输入图像的旋转状态。为模型提供辅助信息——如图像描述、深度图等——或采用思维链提示，仅带来微小且不稳定的改进。我们的结果表明，大多数模型能可靠识别正置（0°）图像，部分模型能识别倒置（180°）图像，但无一能可靠区分90°与270°旋转。同时展示不同旋转方向的图像，对推理模型带来中等程度的性能提升，而采用投票机制的改进设置则提升了较弱模型的表现。进一步研究表明，微调虽显著提升了模型识别180°旋转图像的能力，却未能改善其区分90°与270°旋转的能力。这些结果共同揭示了MLLMs在空间推理能力与人类感知旋转方面存在的显著差距。

English

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

RotBench：评估多模态大语言模型在图像旋转识别上的表现

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

摘要

Support