ChatPaper.aiChatPaper

无令牌浪费:生物医学视觉语言模型中的长上下文利用

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

October 4, 2025
作者: Min Woo Sun, Alejandro Lozano, Javier Gamazo Tejero, Vishwesh Nath, Xiao Xiao Sun, James Burgess, Yuhui Zhang, Kun Yuan, Robert Tibshirani, Sean Huver, Serena Yeung-Levy
cs.AI

摘要

視覺語言模型(VLMs)通常以短文本窗口(<77個標記)進行預訓練,這導致長格式描述被迫截斷。然而,從大規模開源文獻中提取的生物醫學描述分佈顯示,大量描述遠超77個標記。為此,我們通過擴展VLMs中文本編碼器的上下文長度,探討了在長格式生物醫學描述上進行預訓練的影響。我們發現,更長的上下文(從而利用長格式描述提供的額外監督)與更好的檢索和分類性能相關。基於這一發現,我們引入了BIOMEDICA-LongCAP,這是一個包含100萬張圖像-描述對的數據集,其中描述來自全文文章,提供了更長且更具上下文感知的文本監督。利用BIOMEDICA-LongCAP,我們訓練了BMC-LongCLIP,這是一種支持高達512個標記窗口的長上下文生物醫學VLM,其文本編碼器將上下文容量擴展了6.6倍,將標記浪費從55%降至僅2.2%。在長描述檢索基準測試中,BMC-LongCLIP在Recall@1上實現了高達+30%的絕對增益,在分類上平均提升了+2%,同時比短上下文模型收斂更快。我們的結果表明,長上下文建模是推進生物醫學VLMs的一個有前景的方向。
English
Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.
PDF22October 8, 2025