RotBench: 画像回転の識別におけるマルチモーダル大規模言語モデルの評価

要旨

マルチモーダル大規模言語モデル（MLLM）が、0度、90度、180度、270度に回転した入力画像の向きをどの程度正確に識別できるかを調査します。このタスクは、画像の向きに関わらず、回転の手がかりを検出し、空間的関係を文脈化するための堅牢な視覚的推論能力を必要とします。これらの能力を評価するために、RotBenchという350枚の手動で選別されたベンチマークを導入しました。このベンチマークは、ライフスタイル、ポートレート、風景画像で構成されています。このタスクが比較的単純であるにもかかわらず、GPT-5、o3、Gemini-2.5-Proなど、いくつかの最先端のオープンおよびプロプライエタリなMLLMが、入力画像の回転を確実に識別できないことを示します。モデルにキャプション、深度マップなどの補助情報を提供したり、連鎖的思考（chain-of-thought）プロンプトを使用したりしても、わずかで一貫性のない改善しか得られません。結果は、ほとんどのモデルが正立（0度）画像を確実に識別できる一方、特定のモデルが逆さま（180度）画像を識別できることを示しています。しかし、90度と270度を確実に区別できるモデルはありません。異なる向きに回転した画像を同時に表示することで、推論モデルのパフォーマンスが中程度向上し、投票を使用した修正セットアップにより、弱いモデルのパフォーマンスが向上します。さらに、微調整（fine-tuning）を行っても、90度と270度の回転を区別する能力は向上しないものの、180度画像の識別は大幅に改善されることを示します。これらの結果を総合すると、MLLMの空間的推論能力と人間の知覚との間に、回転識別において大きなギャップがあることが明らかになります。

English

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

RotBench: 画像回転の識別におけるマルチモーダル大規模言語モデルの評価

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

要旨

Support