持续优化的图像模型需配以持续升级的基准测试
Constantly Improving Image Models Need Constantly Improving Benchmarks
October 16, 2025
作者: Jiaxin Ge, Grace Luo, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, David M. Chan
cs.AI
摘要
近期,由GPT-4o图像生成等专有系统推动的图像生成技术不断取得新进展,这些进展重塑了用户与这些模型的互动方式。现有的基准测试往往滞后,未能捕捉到这些新兴应用场景,导致社区对进展的认知与正式评估之间出现脱节。为解决这一问题,我们提出了ECHO框架,该框架直接从模型使用的现实证据中构建基准测试:即展示新颖提示和用户定性判断的社交媒体帖子。应用此框架于GPT-4o图像生成,我们构建了一个包含超过31,000条提示的数据集,这些提示均源自此类帖子。我们的分析表明,ECHO(1)发现了现有基准测试中缺失的创意与复杂任务,如跨语言重新渲染产品标签或生成指定总额的收据;(2)更清晰地区分了最先进模型与其他替代方案;(3)汇集了社区反馈,我们利用这些反馈来指导模型质量指标的设计(例如,测量观察到的颜色、身份和结构的变化)。我们的网站地址为https://echo-bench.github.io。
English
Recent advances in image generation, often driven by proprietary systems like
GPT-4o Image Gen, regularly introduce new capabilities that reshape how users
interact with these models. Existing benchmarks often lag behind and fail to
capture these emerging use cases, leaving a gap between community perceptions
of progress and formal evaluation. To address this, we present ECHO, a
framework for constructing benchmarks directly from real-world evidence of
model use: social media posts that showcase novel prompts and qualitative user
judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset
of over 31,000 prompts curated from such posts. Our analysis shows that ECHO
(1) discovers creative and complex tasks absent from existing benchmarks, such
as re-rendering product labels across languages or generating receipts with
specified totals, (2) more clearly distinguishes state-of-the-art models from
alternatives, and (3) surfaces community feedback that we use to inform the
design of metrics for model quality (e.g., measuring observed shifts in color,
identity, and structure). Our website is at https://echo-bench.github.io.