RotBench:评估多模态大语言模型在图像旋转识别上的表现
RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
August 19, 2025
作者: Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
cs.AI
摘要
我们探究了多模态大语言模型(MLLMs)在准确识别输入图像旋转角度(0°、90°、180°、270°)方面的能力。这一任务要求模型具备强大的视觉推理能力,以检测旋转线索并理解图像内部的空间关系,无论其方向如何。为评估MLLMs在此类能力上的表现,我们引入了RotBench——一个包含350张经过人工筛选的生活、肖像和风景图像的基准测试集。尽管任务看似简单,我们发现包括GPT-5、o3和Gemini-2.5-Pro在内的多个先进开源及专有MLLMs,均无法可靠地识别输入图像的旋转状态。为模型提供辅助信息——如图像描述、深度图等——或采用思维链提示,仅带来微小且不稳定的改进。我们的结果表明,大多数模型能可靠识别正置(0°)图像,部分模型能识别倒置(180°)图像,但无一能可靠区分90°与270°旋转。同时展示不同旋转方向的图像,对推理模型带来中等程度的性能提升,而采用投票机制的改进设置则提升了较弱模型的表现。进一步研究表明,微调虽显著提升了模型识别180°旋转图像的能力,却未能改善其区分90°与270°旋转的能力。这些结果共同揭示了MLLMs在空间推理能力与人类感知旋转方面存在的显著差距。
English
We investigate to what extent Multimodal Large Language Models (MLLMs) can
accurately identify the orientation of input images rotated 0{\deg}, 90{\deg},
180{\deg}, and 270{\deg}. This task demands robust visual reasoning
capabilities to detect rotational cues and contextualize spatial relationships
within images, regardless of their orientation. To evaluate MLLMs on these
abilities, we introduce RotBench -- a 350-image manually-filtered benchmark
comprising lifestyle, portrait, and landscape images. Despite the relatively
simple nature of this task, we show that several state-of-the-art open and
proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably
identify rotation in input images. Providing models with auxiliary information
-- including captions, depth maps, and more -- or using chain-of-thought
prompting offers only small and inconsistent improvements. Our results indicate
that most models are able to reliably identify right-side-up (0{\deg}) images,
while certain models are able to identify upside-down (180{\deg}) images. None
can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing
the image rotated in different orientations leads to moderate performance gains
for reasoning models, while a modified setup using voting improves the
performance of weaker models. We further show that fine-tuning does not improve
models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite
substantially improving the identification of 180{\deg} images. Together, these
results reveal a significant gap between MLLMs' spatial reasoning capabilities
and human perception in identifying rotation.