RotBench: 이미지 회전 식별에서 멀티모달 대규모 언어 모델 평가

초록

다중모드 대형 언어 모델(MLLMs)이 0도, 90도, 180도, 270도로 회전된 입력 이미지의 방향을 얼마나 정확하게 식별할 수 있는지 조사합니다. 이 작업은 이미지의 방향에 관계없이 회전 단서를 감지하고 공간적 관계를 맥락화하는 강력한 시각적 추론 능력을 요구합니다. 이러한 능력을 평가하기 위해, 우리는 생활, 초상, 풍경 이미지로 구성된 350장의 수동 필터링된 벤치마크인 RotBench을 소개합니다. 이 작업이 비교적 단순함에도 불구하고, GPT-5, o3, Gemini-2.5-Pro를 포함한 여러 최첨단 오픈 및 독점 MLLMs가 입력 이미지의 회전을 신뢰성 있게 식별하지 못함을 보여줍니다. 모델에 캡션, 깊이 맵 등의 보조 정보를 제공하거나 사고 연쇄 프롬프팅을 사용하는 것은 작고 일관성 없는 개선만을 제공합니다. 우리의 결과는 대부분의 모델이 정상 방향(0도) 이미지를 신뢰성 있게 식별할 수 있는 반면, 일부 모델은 거꾸로 된(180도) 이미지를 식별할 수 있음을 나타냅니다. 그러나 90도와 270도를 신뢰성 있게 구분할 수 있는 모델은 없습니다. 동시에 다른 방향으로 회전된 이미지를 보여주는 것은 추론 모델에게 중간 정도의 성능 향상을 가져오는 반면, 투표를 사용한 수정된 설정은 약한 모델의 성능을 개선합니다. 또한, 미세 조정이 90도와 270도 회전을 구분하는 모델의 능력을 개선하지 못하는 반면, 180도 이미지 식별은 상당히 개선됨을 보여줍니다. 이러한 결과들은 MLLMs의 공간적 추론 능력과 인간의 회전 인식 간에 상당한 격차가 있음을 드러냅니다.

English

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

RotBench: 이미지 회전 식별에서 멀티모달 대규모 언어 모델 평가

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

초록

Support