DiningBench: 식이 영역 인식 및 추론을 위한 계층적 다중 뷰 벤치마크

초록

비전-언어 모델(VLM)의 최근 발전은 일반적인 시각 이해 분야에 혁신을 가져왔습니다. 그러나 식음료 도메인에서의 적용은 여전히 coarse-grained 범주, 단일 시점 이미지, 부정확한 메타데이터에 의존하는 벤치마크에 의해 제한되고 있습니다. 이러한 격차를 해소하기 위해 본 연구에서는 세 가지 인지 복잡도 수준(세분화 분류, 영양소 추정, 시각 질의응답)에서 VLM을 평가하기 위해 설계된 계층적, 다중 시점 벤치마크인 DiningBench를 소개합니다. 기존 데이터셋과 달리 DiningBench는 3,021개의 고유 요리로 구성되며 항목당 평균 5.27개의 이미지를 포함하고, 동일 메뉴 내 'hard' negative 샘플과 검증 기반의 엄격한 영양 데이터를 통합했습니다. 29개의 최신 오픈소스 및 상용 모델에 대한 포괄적인 평가를 수행한 결과, 현재 VLM은 일반적인 추론에서는 뛰어난 성능을 보이지만 세분화된 시각 판별과 정확한 영양 추론에는 상당한 어려움을 겪는 것으로 나타났습니다. 또한 다중 시점 입력과 Chain-of-Thought 추론의 영향을 체계적으로 분석하여 다섯 가지 주요 실패 유형을 규명했습니다. DiningBench는 식음료 중심 VLM 연구의 다음 세대를 이끌 도전적인 테스트베드 역할을 하며, 모든 코드는 https://github.com/meituan/DiningBench에서 공개됩니다.

English

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.

DiningBench: 식이 영역 인식 및 추론을 위한 계층적 다중 뷰 벤치마크

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

초록

Support