ChatPaper.aiChatPaper

持续优化的图像模型需要持续升级的基准测试

Constantly Improving Image Models Need Constantly Improving Benchmarks

October 16, 2025
作者: Jiaxin Ge, Grace Luo, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, David M. Chan
cs.AI

摘要

近期,图像生成领域取得了显著进展,这些进展往往由诸如GPT-4o Image Gen等专有系统推动,不断引入新功能,重塑用户与这些模型的互动方式。然而,现有的基准测试往往滞后,未能捕捉到这些新兴应用场景,导致社区对进展的认知与正式评估之间出现脱节。为解决这一问题,我们提出了ECHO框架,该框架直接从模型实际使用证据中构建基准测试:即展示新颖提示和用户定性判断的社交媒体帖子。将此框架应用于GPT-4o Image Gen,我们构建了一个包含超过31,000条提示的数据集,这些提示均从相关帖子中精选而来。我们的分析表明,ECHO(1)发现了现有基准测试中缺失的创意与复杂任务,如跨语言重新渲染产品标签或生成指定金额的收据,(2)更清晰地区分了最先进模型与其他替代方案,以及(3)汇集了社区反馈,我们利用这些反馈来指导模型质量指标的设计(例如,测量观察到的颜色、身份和结构变化)。我们的网站地址为https://echo-bench.github.io。
English
Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at https://echo-bench.github.io.
PDF52October 21, 2025