ViGoR-Bench：視覺生成模型距離零樣本視覺推理器還有多遠？

摘要

在當代人工智慧生成內容模型令人驚艷的視覺真實感之下，潛藏著一片「邏輯荒漠」——系統在需要物理、因果或複雜空間推理的任務中頻頻失效。現有評估方法大多依賴表面指標或碎片化基準，形成了忽略生成過程的「性能幻象」。為此，我們推出ViGoR（視覺生成推理基準），這套統一框架旨在破除此種幻象。ViGoR憑藉四大創新脫穎而出：1）橫跨圖像到影片任務的跨模態全景覆蓋；2）同步評估中間過程與最終結果的雙軌機制；3）基於證據的自動化評判系統確保高人機一致性；4）將性能分解為細粒度認知維度的診斷分析。對超過20個頂尖模型的實驗表明，即使最先進的系統仍存在顯著推理缺陷，確立了ViGoR作為新一代智能視覺模型關鍵「壓力測試」的地位。演示頁面已上線：https://vincenthancoder.github.io/ViGoR-Bench/

English

Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/

ViGoR-Bench：視覺生成模型距離零樣本視覺推理器還有多遠？

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

摘要

Support