CurveBench: 入れ子構造を持つジョルダン曲線に対する正確な位相的推論のためのベンチマーク

要旨

本稿では、視覚入力からの階層的トポロジカル推論のためのベンチマークであるCurveBenchを紹介する。CurveBenchは、易しい構成、多角形、地形に着想を得た構成、迷路状構成、高密度計数構成にわたる、ペアワイズ非交差のジョルダン曲線を含む798枚の画像から構成される。各画像には、平面領域間の包含関係を符号化するルート付き木がアノテーションされている。タスクは構造予測として定式化される。すなわち、モデルは画像が与えられると、曲線によって誘導される完全なルート付き包含木を復元しなければならない。タスクの視覚的な単純さにもかかわらず、評価された中で最強のモデルであるGemini 3.1 Proは、CurveBench-Easyで71.1％、CurveBench-Hardで19.1％の木生成精度しか達成していない。さらに、RLVRスタイルのファインチューニングによるオープンウェイト視覚言語モデルを用いて、ベンチマークの有用性を実証する。訓練された当社のQwen3-VL-8Bモデルは、CurveBench-Easyにおいて、Qwen-3-VL-8B-Thinkingの2.8％から33.3％へと木生成精度を向上させ、当社の評価プロトコルにおいてGPT-5.4およびClaude Opus 4.5を上回った。残るギャップ、特にCurveBench-Hardにおけるそれは、正確なトポロジー認識型視覚推論が依然として解決にはほど遠いことを示している。

English

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of 756 images of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only 71.1\% tree-generation accuracy on CurveBench-Easy and 19.1\% on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over Qwen-3-VL-8B-Thinking from 2.8\% to 33.3\% tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.