CurveBench：針對嵌套若爾當曲線的精確拓撲推理基準測試

摘要

我們介紹了CurveBench，這是一項基於視覺輸入的層次拓撲推理基準測試。CurveBench包含756張成對不相交若爾當曲線的圖像，涵蓋簡單、多邊形、地形啟發、迷宮式及密集計數等配置。每張圖像都附有表示平面區域間包含關係的有根樹標註。我們將任務定義為結構化預測：給定一張圖像，模型必須還原由曲線導出的完整有根包含樹。儘管該任務在視覺上看似簡單，但表現最佳的模型Gemini 3.1 Pro在CurveBench-Easy上僅達到71.1%的樹生成準確率，在CurveBench-Hard上為19.1%。我們進一步透過RLVR風格的微調，展示了該基準測試在開源權重視覺語言模型上的實用性。我們訓練的Qwen3-VL-8B模型在CurveBench-Easy上的樹生成準確率從2.8%提升至33.3%，在我們的評估協議下超越了GPT-5.4和Claude Opus 4.5。剩餘的差距，特別是在CurveBench-Hard上的差距，顯示精確的拓撲感知視覺推理仍遠未解決。

English

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of 756 images of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only 71.1\% tree-generation accuracy on CurveBench-Easy and 19.1\% on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over Qwen-3-VL-8B-Thinking from 2.8\% to 33.3\% tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.