線形分離可能性の限界を超えて

要旨

最先端の視覚言語モデル（VLM）の多くは、抽象的な推論タスクにおける視覚埋め込みの線形分離可能性によって制限されているように見えます。本研究では、この「線形推論ボトルネック」を調査するために、VLMの視覚埋め込みに対する単純な線形分類器の性能である「線形分離上限（LSC）」を導入します。このボトルネックが広く存在し、知覚の欠如ではなく、言語モデルの推論経路の失敗に起因していることを明らかにします。これは解決可能なアライメント問題であることを示します。ただし、必要な介入はタスク依存であり、意味概念に対しては既存の経路を活性化するだけで十分である一方、複雑な関係推論にはコアモデルの重みを適応させる必要があります。メソドロジカルコントロールとしてポストフィックスチューニングを使用することで、VLM内に強力だが休眠状態の推論経路が存在することを強く示唆します。しかし、より深い適応を必要とする複雑な関係タスクにおいては、表現品質を明示的に向上させると、埋め込みが良好に分離されたままでも、新しいプロンプト形式に対してモデルが失敗することがわかります。最終的に、本研究はVLM分析の新しい視点を提供し、堅牢な推論は単なる表現学習の向上ではなく、ターゲットを絞ったアライメントの問題であることを示しています。

English

Most state-of-the-art Visual-Language Models (VLMs) are seemingly limited by the linear separabilty of their visual embeddings on abstract reasoning tasks. This work investigates this "linear reasoning bottleneck" by introducing the Linear Separability Ceiling (LSC), the performance of a simple linear classifier on a VLM's visual embeddings. We find this bottleneck is widespread and stems not from poor perception, but from failures in the language model's reasoning pathways. We demonstrate this is a solvable alignment issue. The required intervention, however, is task-dependent: activating existing pathways suffices for semantic concepts, while complex relational reasoning requires adapting core model weights. Using postfix tuning as a methodological control, we find strong evidence for powerful, dormant reasoning pathways within VLMs. However, for complex relational tasks requiring deeper adaptation, explicitly improving representation quality causes the model to fail on new prompt formats despite its embeddings remaining well separated. Ultimately, this work provides a new lens for VLM analysis, showing that robust reasoning is a matter of targeted alignment, not simply improved representation learning.

線形分離可能性の限界を超えて

Beyond the Linear Separability Ceiling

要旨

Support