用户未言明之意:模糊查询限制视觉语言模型性能
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
January 7, 2026
作者: Dasol Choi, Guijin Son, Hanwool Lee, Minhyuk Kim, Hyunwoo Ko, Teabin Lim, Ahn Eungyeol, Jungwhan Kim, Seunghyeok Hong, Youngsook Song
cs.AI
摘要
当前视觉语言基准测试主要采用结构规整、提示明确的问答形式。然而真实用户提问往往具有非正式性和信息不完整的特征。用户会自然省略大量背景信息,依赖图像传递语境。我们推出HAERAE-Vision基准数据集,从韩国网络社区8.6万条候选问题中筛选出653个真实视觉问题(留存率0.76%),每个问题均配有显式重写版本,共形成1,306组查询变体。在对39个视觉语言模型的评估中发现,即便是最先进的模型(GPT-5、Gemini 2.5 Pro)在原始查询上的准确率也不足50%。关键的是,仅通过查询显式化处理就能带来8至22个百分点的性能提升,其中较小模型获益最大。研究进一步表明,即使结合网络搜索,信息不完整查询的表现仍不及未经搜索的显式查询,这揭示出现有检索技术无法弥补用户隐含的信息缺口。我们的发现证实,视觉语言模型面临的困难很大程度上源于自然查询的信息缺失而非模型能力不足,这凸显出基准测试与实际应用之间存在关键差距。
English
Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.