크로스워드벤치: 제어 가능한 퍼즐 생성으로 LLM과 LVLM의 추론 능력 평가

초록

기존의 대형 언어 모델(LLMs)과 대형 시각-언어 모델(LVLMs)을 위한 추론 평가 프레임워크는 주로 텍스트 기반 추론 능력이나 시각-언어 이해 능력을 평가하는 데 초점을 맞추고 있으며, 텍스트와 시각적 제약 간의 동적 상호작용은 제한적으로 다루어져 왔습니다. 이러한 한계를 해결하기 위해, 우리는 크로스워드 퍼즐이라는 매체를 통해 LLMs와 LVLMs의 추론 능력을 평가하기 위한 벤치마크인 CrossWordBench를 소개합니다. 이 작업은 텍스트 기반 단서에서의 의미적 제약과 시각적 그리드 구조에서의 교차적 제약을 다중 모드로 준수해야 하는 과제입니다. CrossWordBench는 다양한 형식(텍스트 및 이미지)으로 퍼즐을 생성할 수 있는 제어 가능한 퍼즐 생성 프레임워크를 활용하며, 직접 퍼즐 해결부터 상호작용 모드까지 다양한 평가 전략을 제공합니다. 20개 이상의 모델에 대한 광범위한 평가를 통해, 교차 문자 제약을 효과적으로 활용하는 추론 LLMs가 비추론 모델을 크게 능가한다는 것을 확인했습니다. 또한, LVLMs가 이 과제에 어려움을 겪으며, 퍼즐 해결 성능과 그리드 파싱 정확도 간에 강한 상관관계가 있음을 보여주었습니다. 우리의 연구 결과는 현재 LLMs와 LVLMs의 추론 능력의 한계에 대한 통찰을 제공하며, 향후 평가를 위한 다중 모드 제약 과제를 생성하는 효과적인 접근 방식을 제시합니다.

English

Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

크로스워드벤치: 제어 가능한 퍼즐 생성으로 LLM과 LVLM의 추론 능력 평가

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

초록

Support