MARVIS: 시각화 자료에 대한 모달리티 적응형 추론

초록

머신러닝의 과학적 응용은 종종 특정 도메인에 맞춰 조정된 소규모의 전문 모델에 의존합니다. 이러한 모델은 뛰어난 성능을 달성하지만 유연성이 부족합니다. 파운데이션 모델은 다재다능성을 제공하지만, 특히 비전통적인 모달리티와 롱테일 도메인에서 전문적인 접근 방식에 비해 성능이 떨어지는 경우가 많습니다. 우리는 MARVIS(Modality Adaptive Reasoning over VISualizations)를 제안합니다. 이는 훈련이 필요 없는 방법으로, 작은 비전-언어 모델도 어떤 데이터 모달리티든 높은 정확도로 예측할 수 있게 합니다. MARVIS는 잠재 임베딩 공간을 시각적 표현으로 변환한 다음, VLM의 공간적 및 세밀한 추론 능력을 활용하여 이를 성공적으로 해석하고 활용합니다. MARVIS는 단일 3B 파라미터 모델을 사용하여 비전, 오디오, 생물학, 테이블 형식의 도메인에서 경쟁력 있는 성능을 달성하며, 평균적으로 Gemini를 16% 앞서고 전문적인 방법에 근접한 결과를 얻습니다. 이 과정에서 개인 식별 정보(P.I.I.)를 노출시키거나 도메인별 훈련을 요구하지 않습니다. 우리는 코드와 데이터셋을 https://github.com/penfever/marvis에서 오픈소스로 공개합니다.

English

Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16\% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at https://github.com/penfever/marvis