MARVIS：面向可视化数据的模态自适应推理系统

摘要

机器学习在科学领域的应用往往依赖于针对特定领域优化的小型专用模型。这类模型通常表现出色，但缺乏灵活性。基础模型虽具备通用性，但在非传统模态和长尾领域上，其表现通常不及专用方法。我们提出了MARVIS（面向可视化的模态自适应推理），这是一种无需训练的方法，使小型视觉语言模型也能高精度预测任意数据模态。MARVIS将潜在嵌入空间转化为视觉表示，进而利用视觉语言模型的空间与细粒度推理能力，成功解读并运用这些表示。MARVIS仅使用一个30亿参数的模型，在视觉、音频、生物和表格数据领域均取得了具有竞争力的性能，平均超越Gemini模型16%，且无需暴露个人可识别信息（P.I.I.）或进行任何领域特定训练。我们的代码和数据集已在https://github.com/penfever/marvis开源。

English

Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16\% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at https://github.com/penfever/marvis