ProactiveBench: 멀티모달 대규모 언어 모델의 주도성 벤치마킹

초록

효과적인 협력은 언제 도움을 요청해야 하는지를 아는 데서 시작됩니다. 예를 들어, 가려진 물체를 식별하려 할 때 인간은 누군가에게 장애물을 제거해 달라고 요청할 것입니다. MLLM이 간단한 사용자 개입을 요청하는 유사한 "능동적" 행동을 보일 수 있을까요? 이를 조사하기 위해 우리는 다양한 작업(가려진 물체 인식, 이미지 품질 향상,粗略한 스케치 해석 등)에서 능동성을 테스트하는 7개의 재구성된 데이터셋으로 구축된 벤치마크인 ProactiveBench를 소개합니다. 우리는 ProactiveBench에서 22개의 MLLM을 평가하여 다음과 같은 결과를 보여줍니다: (i) MLLM은 일반적으로 능동성이 부족함, (ii) 능동성은 모델 역량과 상관관계가 없음, (iii) 능동성을 "암시"해도 개선 효과는 미미함. 놀랍게도, 대화 기록과 컨텍스트 내 학습은 오히려 부정적인 편향을 도입하여 성능을 저하시키는 것으로 나타났습니다. 마지막으로, 강화 학습 기반의 간단한 미세 조정 전략을 탐구한 결과, 능동성이 학습 가능하며 보지 않은 시나리오로도 일반화될 수 있음을 시사합니다. 우리는 능동적인 멀티모달 모델 구축을 위한 첫걸음으로 ProactiveBench를 공개합니다.

English

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

ProactiveBench: 멀티모달 대규모 언어 모델의 주도성 벤치마킹

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

초록

Support