HumanEval-V：通过编码任务评估大型多模态模型的视觉理解和推理能力

摘要

编码任务对于评估大型语言模型（LLMs）非常有价值，因为它们要求理解高级指令、复杂推理以及实现功能程序，这是推动人工通用智能发展的核心能力。尽管大型多模态模型（LMMs）取得了进展，将LLMs与视觉感知和理解能力相结合，但在严格评估这些模型的编码基准方面仍存在明显缺乏，特别是在强调视觉推理的任务中。为了填补这一空白，我们介绍了HumanEval-V，这是一个新颖且轻量级的基准，专门设计用于评估LMMs的视觉理解和推理能力，通过代码生成。HumanEval-V包括108个精心设计的入门级Python编码任务，这些任务源自CodeForces和Stack Overflow等平台。通过修改原始问题的上下文和算法模式，并重新绘制视觉元素以确保与源的区别，防止潜在数据泄漏，每个任务都经过了调整。LLMs需要根据提供的视觉上下文和预定义的Python函数签名来完成代码解决方案，概述任务要求。每个任务都配备了精心制作的测试用例，以确保对模型生成的解决方案进行彻底可靠的评估。我们使用HumanEval-V评估了19个最先进的LLMs，揭示了重大挑战。像GPT-4o这样的专有模型仅实现了13%的pass@1和36.4%的pass@10，而具有70B参数的开放权重模型在pass@1方面得分低于4%。消融研究进一步揭示了当前LLMs在视觉推理和编码能力方面的局限性。这些结果强调了未来研究增强LLMs能力的关键领域。我们已在https://github.com/HumanEval-V/HumanEval-V-Benchmark上开源了我们的代码和基准。

English

Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs' capabilities. We have open-sourced our code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.

HumanEval-V：通过编码任务评估大型多模态模型的视觉理解和推理能力

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

摘要

Support