MARVIS：面向視覺化的模態自適應推理

摘要

机器学习的科学应用常依赖于针对特定领域调优的小型专用模型。此类模型虽能取得卓越性能，却缺乏灵活性。基础模型虽具通用性，但在非传统模态及长尾领域上，通常表现不及专用方法。我们提出MARVIS（模态自适应可视化推理），一种无需训练的方法，使小型视觉语言模型也能高精度预测任意数据模态。MARVIS通过将潜在嵌入空间转化为视觉表征，进而利用视觉语言模型的空间与细粒度推理能力，成功解读并运用这些表征。MARVIS采用单一3B参数模型，在视觉、音频、生物及表格数据领域均展现出竞争力，平均超越Gemini 16%，逼近专用方法，且无需暴露个人可识别信息（P.I.I.）或进行任何领域特定训练。我们的代码与数据集已开源，详见https://github.com/penfever/marvis。

English

Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16\% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at https://github.com/penfever/marvis