DiningBench：面向饮食领域感知与推理的分层多视角基准评测体系

摘要

近期视觉语言模型（VLM）的进展已彻底革新通用视觉理解领域，然而其在餐饮领域的应用仍受限于依赖粗粒度分类、单视角图像及不准确元数据的基准测试。为弥补这一空白，我们推出DiningBench——一个分层多视角基准测试框架，旨在从三个认知复杂度层级评估VLM性能：细粒度分类、营养估算和视觉问答。与既有数据集不同，DiningBench包含3,021道独特菜品，每道菜品平均配备5.27张多视角图像，引入来自相同菜单的细粒度"困难负样本"及经过严格验证的营养数据。我们对29个前沿开源与商业模型开展大规模评估，实验表明：当前VLM虽擅长通用推理，但在细粒度视觉辨别和精准营养推理方面存在显著不足。此外，我们系统探究了多视角输入与思维链推理的影响，识别出五类主要失效模式。DiningBench将作为推动下一代餐饮导向VLM研究的挑战性测试平台，所有代码已发布于https://github.com/meituan/DiningBench。

English

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.

DiningBench：面向饮食领域感知与推理的分层多视角基准评测体系

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

摘要

Support