トークンの無駄をなくす：バイオメディカル視覚言語モデルにおける長文脈の活用

要旨

視覚言語モデル（VLMs）は通常、短いテキストウィンドウ（<77トークン）で事前学習されるため、長文キャプションの切り捨てが強制されます。しかし、大規模なオープンソース文献から得られる生物医学キャプションの分布を分析すると、77トークンを大幅に超えるキャプションが多数存在することが明らかになりました。この問題に対処するため、我々はVLMsのテキストエンコーダのコンテキスト長を拡張し、長文生物医学キャプションに対する事前学習の影響を調査しました。その結果、より長いコンテキスト（つまり、長文キャプションに含まれる追加の教師信号）が、検索および分類性能の向上と相関することがわかりました。この知見に基づき、我々はBIOMEDICA-LongCAPを導入しました。これは、全文記事から得られたコンテキストを考慮した記述を追加した100万の画像キャプションペアからなるデータセットであり、より長く、追加のテキスト教師信号を提供します。BIOMEDICA-LongCAPを使用して、最大512トークンのウィンドウをサポートするテキストエンコーダを備えた長文コンテキスト生物医学VLMであるBMC-LongCLIPを学習しました。我々のモデルはコンテキスト容量を6.6倍に拡張し、トークンの無駄を55%からわずか2.2%に削減しました。長文キャプション検索ベンチマークにおいて、BMC-LongCLIPはRecall@1で最大+30%の絶対的な向上を達成し、分類においても平均+2%の改善を示しました。さらに、短いコンテキストよりも高速に収束しました。これらの結果は、長文コンテキストモデリングが生物医学VLMsを進歩させるための有望な方向性であることを示しています。

English

Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

トークンの無駄をなくす：バイオメディカル視覚言語モデルにおける長文脈の活用

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

要旨

Support