MathVerse: Ziet uw multimodale LLM echt de diagrammen in visuele wiskundeproblemen?

Samenvatting

De opmerkelijke vooruitgang van Multi-modale Grote Taalmodellen (MLLMs) heeft ongeëvenaarde aandacht getrokken vanwege hun superieure prestaties in visuele contexten. Hun mogelijkheden voor het oplossen van visuele wiskundige problemen zijn echter nog onvoldoende geëvalueerd en begrepen. Wij onderzoeken huidige benchmarks die overmatig visuele inhoud in tekstuele vragen incorporeren, wat MLLMs mogelijk helpt bij het afleiden van antwoorden zonder de invoerdiagrammen echt te interpreteren. Daarom introduceren wij MathVerse, een allesomvattende visuele wiskundebenchmark ontworpen voor een eerlijke en diepgaande evaluatie van MLLMs. Wij verzamelen zorgvuldig 2.612 hoogwaardige, multi-disciplinaire wiskundeproblemen met diagrammen uit openbaar beschikbare bronnen. Elk probleem wordt vervolgens door menselijke annotatoren omgezet in zes verschillende versies, elk met verschillende niveaus van informatie-inhoud in multi-modaliteit, wat resulteert in in totaal 15K testsamples. Deze aanpak stelt MathVerse in staat om uitgebreid te beoordelen of en in hoeverre MLLMs de visuele diagrammen daadwerkelijk begrijpen voor wiskundige redenering. Daarnaast stellen wij een Chain-of-Thought (CoT) evaluatiestrategie voor voor een fijnmazige beoordeling van de uitvoerantwoorden. In plaats van naïef waar of onwaar te beoordelen, gebruiken wij GPT-4(V) om cruciale redeneerstappen adaptief te extraheren en vervolgens elke stap te scoren met gedetailleerde foutenanalyse, wat de tussenliggende CoT-redeneerkwaliteit van MLLMs kan onthullen. Wij hopen dat de MathVerse-benchmark unieke inzichten kan bieden om de toekomstige ontwikkeling van MLLMs te begeleiden. Projectpagina: https://mathverse-cuhk.github.io

English

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io

MathVerse: Ziet uw multimodale LLM echt de diagrammen in visuele wiskundeproblemen?

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Samenvatting

Support