DiningBench: 食事領域における知覚と推論のための階層的マルチビューベンチマーク

要旨

ビジョン・ランゲージモデル（VLM）の近年の進歩は、一般的な視覚理解に革命をもたらしました。しかし、食品領域への応用は、粗い粒度のカテゴリ、単一視点画像、不正確なメタデータに依存するベンチマークによって制約を受け続けています。このギャップを埋めるため、我々はDiningBenchを提案します。これは、3つの認知複雑性レベル（細粒度分類、栄養推定、視覚的質問応答）にわたってVLMを評価するために設計された、階層的でマルチビュー対応のベンチマークです。既存のデータセットとは異なり、DiningBenchは3,021の異なる料理で構成され、各エントリー平均5.27枚の画像を含み、同一メニューからの細粒度の「困難な」ネガティブ例と、検証ベースの厳密な栄養データを組み込んでいます。29の最先端オープンソース及びプロプライエタリモデルに対する広範な評価を実施しました。実験結果から、現在のVLMは一般的な推論では優れているものの、細粒度の視覚的識別と精密な栄養推論には著しく苦戦することが明らかになりました。さらに、マルチビュー入力と連鎖的思考推論の影響を体系的に調査し、5つの主要な失敗モードを特定しました。DiningBenchは、次世代の食品中心VLM研究を推進するための挑戦的なテストベッドとして機能します。全てのコードはhttps://github.com/meituan/DiningBench で公開されています。

English

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.

DiningBench: 食事領域における知覚と推論のための階層的マルチビューベンチマーク

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

要旨

Support