Inter

초록

기존 벤치마크는 범용 AI 어시스턴트 개발에 필수적인 대형 멀티모달 모델(LMM)의 인간 사용자와의 상호작용 지능을 테스트하지 않습니다. 우리는 이러한 능력을 자율적으로 평가할 수 있는 상호작용 프레임워크인 InterFeedback를 설계했습니다. 이를 기반으로, 우리는 두 가지 대표적인 데이터셋인 MMMU-Pro와 MathVerse를 사용하여 10개의 오픈소스 LMM을 테스트하는 InterFeedback-Bench를 소개합니다. 또한, OpenAI-o1 및 Claude-3.5-Sonnet과 같은 주요 모델의 상호작용 성능을 수동으로 테스트하기 위해 새로 수집된 120개의 사례를 포함한 InterFeedback-Human 데이터셋을 제시합니다. 우리의 평가 결과는 OpenAI-o1과 같은 최첨단 LMM조차도 인간의 피드백을 통해 결과를 수정하는 비율이 50% 미만임을 보여줍니다. 이러한 발견은 LMM이 피드백을 해석하고 이를 활용할 수 있는 능력을 향상시킬 수 있는 방법의 필요성을 시사합니다.

English

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results show that even state-of-the-art LMM (like OpenAI-o1) can correct their results through human feedback less than 50%. Our findings point to the need for methods that can enhance the LMMs' capability to interpret and benefit from feedback.

Inter

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

초록

Support