ChatPaper.aiChatPaper

ViGoR-Bench:视觉生成模型距离零样本视觉推理器还有多远?

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

March 26, 2026
作者: Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li
cs.AI

摘要

在现代AIGC模型惊艳的视觉保真度之下,潜藏着一片"逻辑荒漠"——系统在处理需要物理、因果或复杂空间推理的任务时频频失效。当前评估方法主要依赖表层指标或碎片化基准,形成了忽视生成过程的"性能幻象"。为此,我们推出ViGoR(视觉生成推理基准),这一统一框架旨在破除此类幻象。ViGoR通过四大创新点实现突破:1)贯通图像到视频任务的全模态覆盖;2)同时评估中间过程与最终结果的双轨机制;3)基于证据的自动化评判器确保高人机一致性;4)将性能分解为细粒度认知维度的诊断分析。对20余个主流模型的实验表明,即使最先进的系统仍存在显著推理缺陷,这使ViGoR成为新一代智能视觉模型的关键"压力测试"。演示页面已上线:https://vincenthancoder.github.io/ViGoR-Bench/
English
Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/
PDF362April 3, 2026