CurveBench: 중첩된 조르단 곡선에 대한 정확한 위상 추론을 위한 벤치마크

초록

CurveBench: 시각적 입력으로부터의 계층적 위상 추론을 위한 벤치마크를 소개한다. CurveBench는 서로 교차하지 않는 조르단 곡선(Jordan curves) 쌍으로 구성된 756개의 이미지를 포함하며, 쉬운(easy) 설정, 다각형(polygonal) 설정, 지형에서 영감을 받은(topographic-inspired) 설정, 미로형(maze-like) 설정, 그리고 밀집된 개수 세기(dense counting) 설정으로 나뉜다. 각 이미지에는 평면 영역 간의 포함 관계를 인코딩한 루트 트리(rooted tree)가 주석으로 달려 있다. 우리는 이 과제를 구조적 예측(structured prediction)으로 정의한다. 즉, 모델이 이미지를 입력받아 곡선에 의해 유도된 완전한 루트 포함 트리(rooted containment tree)를 복원해야 한다. 과제의 시각적 단순성에도 불구하고, 가장 강력한 평가 모델인 Gemini 3.1 Pro는 CurveBench-Easy에서 71.1%의 트리 생성 정확도(tree-generation accuracy)를, CurveBench-Hard에서는 19.1%의 정확도를 기록했다. 또한, RLVR 방식의 미세 조정(RLVR-style fine-tuning)을 통해 공개 가중치 비전-언어 모델(open-weight vision-language models)을 활용한 벤치마크의 유용성을 추가로 입증한다. 우리가 훈련한 Qwen3-VL-8B 모델은 Qwen-3-VL-8B-Thinking 대비 CurveBench-Easy에서 트리 생성 정확도가 2.8%에서 33.3%로 향상되었으며, 이는 우리의 평가 프로토콜에서 GPT-5.4 및 Claude Opus 4.5를 초과하는 성능이다. 특히 CurveBench-Hard에서 남아 있는 격차는 정확한 위상 인식 시각적 추론(exact topology-aware visual reasoning)이 아직 해결되지 않은 과제임을 보여준다.

English

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of 756 images of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only 71.1\% tree-generation accuracy on CurveBench-Easy and 19.1\% on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over Qwen-3-VL-8B-Thinking from 2.8\% to 33.3\% tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.