GeoStack: VLMにおける準アーベル的知識合成のフレームワーク

要旨

本論文では、視覚言語モデル（VLM）における知識合成の課題に取り組む。複数領域やタスクにわたる専門知識の蓄積は、一般的に破滅的忘却を引き起こす。我々は、独立して訓練された領域専門家を単一のモデルに統合するモジュール型フレームワークであるGeoStack（Geometric Stacking）を提案する。アダプタ多様体に幾何学的・構造的制約を課すことで、GeoStackはベースモデルの基礎知識が保持されることを保証する。さらに、統合する専門家の数に関わらず、一定時間（O(1)）の推論複雑性を達成する重畳折りたたみ特性を数学的に示す。複数領域適応とクラス増分学習における実験結果から、GeoStackが破滅的忘却を大幅に軽減しつつ、長期的な知識合成の効率的なメカニズムを提供することが示された。コードはhttps://github.com/QuantitativeImagingLaboratory/GeoStack で公開されている。

English

We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity (O(1)), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.

GeoStack: VLMにおける準アーベル的知識合成のフレームワーク

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

要旨

Support