无令牌浪费:生物医学视觉语言模型中的长上下文利用
No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models
October 4, 2025
作者: Min Woo Sun, Alejandro Lozano, Javier Gamazo Tejero, Vishwesh Nath, Xiao Xiao Sun, James Burgess, Yuhui Zhang, Kun Yuan, Robert Tibshirani, Sean Huver, Serena Yeung-Levy
cs.AI
摘要
視覺語言模型(VLMs)通常以短文本窗口(<77個標記)進行預訓練,這導致長格式描述被迫截斷。然而,從大規模開源文獻中提取的生物醫學描述分佈顯示,大量描述遠超77個標記。為此,我們通過擴展VLMs中文本編碼器的上下文長度,探討了在長格式生物醫學描述上進行預訓練的影響。我們發現,更長的上下文(從而利用長格式描述提供的額外監督)與更好的檢索和分類性能相關。基於這一發現,我們引入了BIOMEDICA-LongCAP,這是一個包含100萬張圖像-描述對的數據集,其中描述來自全文文章,提供了更長且更具上下文感知的文本監督。利用BIOMEDICA-LongCAP,我們訓練了BMC-LongCLIP,這是一種支持高達512個標記窗口的長上下文生物醫學VLM,其文本編碼器將上下文容量擴展了6.6倍,將標記浪費從55%降至僅2.2%。在長描述檢索基準測試中,BMC-LongCLIP在Recall@1上實現了高達+30%的絕對增益,在分類上平均提升了+2%,同時比短上下文模型收斂更快。我們的結果表明,長上下文建模是推進生物醫學VLMs的一個有前景的方向。
English
Embedding vision-language models (VLMs) are typically pretrained with short
text windows (<77 tokens), which forces the truncation of long-format captions.
Yet, the distribution of biomedical captions from large-scale open source
literature reveals that a huge portion of captions far exceed 77 tokens. To
this end, we investigate the impact of pretraining on long-format biomedical
captions by extending the context length of text encoders in VLMs. We find that
longer context (thus, enabling additional supervision provided in long-format
captions) correlates with better retrieval and classification performance.
Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M
image-caption pairs enriched with context-aware descriptions from full-text
articles, providing longer and additional textual supervision. Using
BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a
text encoder supporting windows of up to 512 tokens. Our model extends context
capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption
retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in
Recall@1 and +2% average improvements in classification, while also converging
faster than short-context. Our results demonstrate that long-context modeling
is a promising direction for advancing biomedical VLMs.