MUSE: 대규모 언어 모델의 다중 양식 통합 안전성 평가를 위한 실행 중심 플랫폼

초록

대규모 언어 모델의 안전성 평가와 레드 팀링은 여전히 주로 텍스트 중심으로 이루어지며, 기존 프레임워크는 정렬(alignment)이 오디오, 이미지, 비디오 입력까지 일반화되는지를 체계적으로 테스트할 수 있는 인프라가 부족합니다. 본 논문에서는 MUSE(Multimodal Unified Safety Evaluation)를 소개합니다. MUSE는 오픈소스이며 실행(run) 중심의 플랫폼으로, 자동 교차 모달 페이로드 생성, 세 가지 다중 턴 공격 알고리즘(Crescendo, PAIR, Violent Durian), 공급자에 독립적인 모델 라우팅, 그리고 5단계 안전성 분류 체계를 갖춘 LLM 판단기를 단일 브라우저 기반 시스템에 통합했습니다. 이중 지표 프레임워크는 강성 공격 성공률(순수 Compliance만)과 연성 공격 성공률(Partial Compliance 포함)을 구분하여 이진 지표가 놓치는 부분적 정보 유출을 포착합니다. 정렬이 모달리티 경계를 가로질러 일반화되는지 탐구하기 위해 턴 간 모달리티 전환(Inter-Turn Modality Switching, ITMS)을 도입했습니다. ITMS는 다중 턴 공격을 턴별 모달리티 순환으로 확장합니다. 4개 공급자의 6개 다중 모달 LLM을 대상으로 한 실험 결과, 단일 턴에서는 거의 완벽하게 요청을 거부하는 모델에 대해서도 다중 턴 전략을 통해 최대 90-100%의 공격 성공률을 달성할 수 있음을 보여줍니다. ITMS는 이미 포화 상태인 기준선에서 최종 공격 성공률을 균일하게 높이지는 않지만, 초기 턴의 방어 체계를 불안정하게 만들어 수렴 속도를 가속화합니다. 또한 ablation 실험을 통해 모달리티 효과의 방향이 보편적이기보다 모델 패밀리별로 특정적임을 밝혀, 공급자 인식 교차 모달 안전성 테스트의 필요성을 강조합니다.

English

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.

MUSE: 대규모 언어 모델의 다중 양식 통합 안전성 평가를 위한 실행 중심 플랫폼

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

초록

Support