无令牌浪费：生物医学视觉语言模型中的长上下文利用

摘要

視覺語言模型（VLMs）通常以短文本窗口（<77個標記）進行預訓練，這導致長格式描述被迫截斷。然而，從大規模開源文獻中提取的生物醫學描述分佈顯示，大量描述遠超77個標記。為此，我們通過擴展VLMs中文本編碼器的上下文長度，探討了在長格式生物醫學描述上進行預訓練的影響。我們發現，更長的上下文（從而利用長格式描述提供的額外監督）與更好的檢索和分類性能相關。基於這一發現，我們引入了BIOMEDICA-LongCAP，這是一個包含100萬張圖像-描述對的數據集，其中描述來自全文文章，提供了更長且更具上下文感知的文本監督。利用BIOMEDICA-LongCAP，我們訓練了BMC-LongCLIP，這是一種支持高達512個標記窗口的長上下文生物醫學VLM，其文本編碼器將上下文容量擴展了6.6倍，將標記浪費從55%降至僅2.2%。在長描述檢索基準測試中，BMC-LongCLIP在Recall@1上實現了高達+30%的絕對增益，在分類上平均提升了+2%，同時比短上下文模型收斂更快。我們的結果表明，長上下文建模是推進生物醫學VLMs的一個有前景的方向。

English

Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

无令牌浪费：生物医学视觉语言模型中的长上下文利用

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

摘要

Support