提升多模態基礎模型的認知與可解釋性 ——基於自合成數據的研究

摘要

大型多模態模型（LMMs）在廣泛的視覺任務中展現了令人印象深刻的能力。然而，它們在細粒度的視覺推理上往往表現不佳，難以識別特定領域的目標並為其預測提供合理的解釋。為解決這一問題，我們提出了一種新穎的視覺拒絕採樣框架，利用自我合成的數據來提升LMMs的認知能力和可解釋性。具體而言，視覺微調需要圖像、查詢和目標答案。我們的方法首先合成包含可被人類驗證的視覺特徵的可解釋答案。這些特徵基於專家定義的概念，並根據其與圖像內容的契合度精心挑選。在每一輪微調後，我們應用無獎勵模型的過濾機制，選擇最高質量的可解釋答案用於下一輪調優。這種數據合成與微調的迭代過程逐步提升了模型生成準確且合理解釋的能力。實驗結果表明，我們的方法在提升專業視覺分類任務的準確性和可解釋性方面具有顯著效果。

English

Large multimodal models (LMMs) have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address this, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of data synthesis and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.

提升多模態基礎模型的認知與可解釋性 ——基於自合成數據的研究

Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

摘要

Support