DiningBench:面向饮食领域感知与推理的分层多视角基准评测体系
DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain
April 12, 2026
作者: Song Jin, Juntian Zhang, Xun Zhang, Zeying Tian, Fei Jiang, Guojun Yin, Wei Lin, Yong Liu, Rui Yan
cs.AI
摘要
近期视觉语言模型(VLM)的进展已彻底革新通用视觉理解领域,然而其在餐饮领域的应用仍受限于依赖粗粒度分类、单视角图像及不准确元数据的基准测试。为弥补这一空白,我们推出DiningBench——一个分层多视角基准测试框架,旨在从三个认知复杂度层级评估VLM性能:细粒度分类、营养估算和视觉问答。与既有数据集不同,DiningBench包含3,021道独特菜品,每道菜品平均配备5.27张多视角图像,引入来自相同菜单的细粒度"困难负样本"及经过严格验证的营养数据。我们对29个前沿开源与商业模型开展大规模评估,实验表明:当前VLM虽擅长通用推理,但在细粒度视觉辨别和精准营养推理方面存在显著不足。此外,我们系统探究了多视角输入与思维链推理的影响,识别出五类主要失效模式。DiningBench将作为推动下一代餐饮导向VLM研究的挑战性测试平台,所有代码已发布于https://github.com/meituan/DiningBench。
English
Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.