自己生成データによるマルチモーダル基盤モデルの認知能力と説明可能性の向上

要旨

大規模マルチモーダルモデル（LMMs）は、幅広い視覚タスクにおいて印象的な能力を示しています。しかし、細粒度の視覚推論においてはしばしば苦戦し、ドメイン固有の目的を特定したり、予測に対する正当化可能な説明を提供したりすることができません。この問題に対処するため、我々は自己合成データを用いてLMMsの認知能力と説明可能性を向上させる新しい視覚的リジェクトサンプリングフレームワークを提案します。具体的には、視覚的ファインチューニングには画像、クエリ、およびターゲット回答が必要です。我々のアプローチでは、まず人間が検証可能な視覚的特徴を含む解釈可能な回答を合成します。これらの特徴は、画像内容との整合性に基づいて慎重に選択された専門家定義の概念に基づいています。各ラウンドのファインチューニング後、報酬モデルフリーのフィルタリングメカニズムを適用して、次回のチューニングに使用する最高品質の解釈可能な回答を選択します。このデータ合成とファインチューニングの反復プロセスにより、モデルが正確かつ合理的な説明を生成する能力が段階的に向上します。実験結果は、専門的な視覚分類タスクにおける精度と説明可能性の両方を向上させる我々の手法の有効性を示しています。

English

Large multimodal models (LMMs) have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address this, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of data synthesis and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.

自己生成データによるマルチモーダル基盤モデルの認知能力と説明可能性の向上

Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

要旨

Support