RotBench：評估多模態大型語言模型在圖像旋轉識別上的表現

摘要

我們探討了多模態大型語言模型（MLLMs）在準確識別輸入圖像旋轉角度（0°、90°、180° 和 270°）方面的能力。這項任務需要強大的視覺推理能力，以檢測旋轉線索並在圖像中定位空間關係，無論其方向如何。為了評估 MLLMs 在這些能力上的表現，我們引入了 RotBench——一個包含 350 張經過人工篩選的生活、肖像和風景圖像的基準測試集。儘管這項任務相對簡單，我們發現包括 GPT-5、o3 和 Gemini-2.5-Pro 在內的幾種最先進的開源和專有 MLLMs 並不能可靠地識別輸入圖像的旋轉。為模型提供輔助信息——包括標題、深度圖等——或使用思維鏈提示僅能帶來微小且不一致的改進。我們的結果表明，大多數模型能夠可靠地識別正置（0°）圖像，而某些模型能夠識別倒置（180°）圖像。但沒有一個模型能夠可靠地區分 90° 和 270° 的旋轉。同時展示圖像在不同方向上的旋轉版本，對於推理模型來說能帶來中等的性能提升，而使用投票的修改設置則能提高較弱模型的表現。我們進一步發現，微調並不能提高模型區分 90° 和 270° 旋轉的能力，儘管它顯著改善了對 180° 圖像的識別。總的來說，這些結果揭示了 MLLMs 在識別旋轉方面的空間推理能力與人類感知之間存在顯著差距。

English

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

RotBench：評估多模態大型語言模型在圖像旋轉識別上的表現

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

摘要

Support