MoHoBench: 답변 불가능한 시각적 질문을 통한 다중모드 대형 언어 모델의 정직성 평가

초록

최근 멀티모달 대형 언어 모델(MLLMs)은 시각-언어 작업에서 상당한 발전을 이루었으나, 잠재적으로 유해하거나 신뢰할 수 없는 콘텐츠를 생성할 가능성이 있다. 언어 모델의 신뢰성을 조사한 연구는 많지만, 특히 시각적으로 답변할 수 없는 질문에 직면했을 때 MMLMs의 정직한 행동 능력은 여전히 크게 탐구되지 않고 있다. 본 연구는 다양한 MLLMs의 정직성 행동에 대한 첫 번째 체계적인 평가를 제시한다. 우리는 모델이 답변할 수 없는 시각적 질문에 대한 응답 행동에서 정직성을 정의하고, 이러한 질문의 네 가지 대표 유형을 정의하며, 12,000개 이상의 시각적 질문 샘플로 구성된 대규모 MMLM 정직성 벤치마크인 MoHoBench를 구축했다. 이 벤치마크의 품질은 다단계 필터링과 인간 검증을 통해 보장되었다. MoHoBench를 사용하여 우리는 28개의 인기 있는 MMLMs의 정직성을 벤치마킹하고 포괄적인 분석을 수행했다. 우리의 연구 결과는 다음과 같다: (1) 대부분의 모델이 필요할 때 적절히 답변을 거부하지 못하며, (2) MMLMs의 정직성은 단순히 언어 모델링 문제가 아니라 시각적 정보에 깊이 영향을 받아, 멀티모달 정직성 정렬을 위한 전용 방법의 개발이 필요하다. 따라서 우리는 정직성 행동을 개선하기 위해 지도 학습과 선호 학습을 사용한 초기 정렬 방법을 구현했으며, 이는 신뢰할 수 있는 MLLMs에 대한 향후 연구의 기초를 제공한다. 우리의 데이터와 코드는 https://github.com/DSTTSD/MoHoBench에서 확인할 수 있다.

English

Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at https://github.com/DSTTSD/MoHoBench.

MoHoBench: 답변 불가능한 시각적 질문을 통한 다중모드 대형 언어 모델의 정직성 평가

MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

초록

Support