ChatPaper.aiChatPaper

VQ-VA世界:迈向高质量视觉问答新纪元

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

November 25, 2025
作者: Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, Cihang Xie, Jianfei Cai, Hamid Rezatofighi
cs.AI

摘要

本文研究视觉问答-视觉生成(VQ-VA)技术:针对视觉问题生成图像而非文本回答——这种能力近期已出现在NanoBanana、GPT-Image等专有系统中。为使开源模型也具备此能力,我们提出VQ-VA World框架,该以数据为中心的框架构建于智能代理流水线之上,可实现大规模定向数据构建。通过网络级部署,该流水线爬取了约180万条高质量图文交错样本用于模型训练。在评估方面,我们发布人工标注的IntelligentBench基准测试,从世界知识、设计知识和推理能力三个维度系统评估VQ-VA性能。使用VQ-VA World数据训练带来显著提升:使LightFusion在IntelligentBench上获得53.06分,大幅超越先前最佳开源基线(原始LightFusion为7.78分;UniWorld-V1为1.94分),并显著缩小与领先专有系统的差距(NanoBanana为81.67分;GPT-Image为82.64分)。通过完整发布模型权重、数据集及流水线,我们期望推动VQ-VA领域的未来研究。
English
This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.
PDF72December 1, 2025