ChatPaper.aiChatPaper

ViGoR-Bench:視覺生成模型距離零樣本視覺推理器還有多遠?

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

March 26, 2026
作者: Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li
cs.AI

摘要

在當代人工智慧生成內容模型令人驚艷的視覺真實感之下,潛藏著一片「邏輯荒漠」——系統在需要物理、因果或複雜空間推理的任務中頻頻失效。現有評估方法大多依賴表面指標或碎片化基準,形成了忽略生成過程的「性能幻象」。為此,我們推出ViGoR(視覺生成推理基準),這套統一框架旨在破除此種幻象。ViGoR憑藉四大創新脫穎而出:1)橫跨圖像到影片任務的跨模態全景覆蓋;2)同步評估中間過程與最終結果的雙軌機制;3)基於證據的自動化評判系統確保高人機一致性;4)將性能分解為細粒度認知維度的診斷分析。對超過20個頂尖模型的實驗表明,即使最先進的系統仍存在顯著推理缺陷,確立了ViGoR作為新一代智能視覺模型關鍵「壓力測試」的地位。演示頁面已上線:https://vincenthancoder.github.io/ViGoR-Bench/
English
Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/
PDF362April 3, 2026