BidirLM：通過適配與組合因果語言模型實現從文本到全模態雙向編碼器的演進

摘要

將因果生成式語言模型轉化為雙向編碼器，為BERT式架構提供了強大的替代方案。然而現有方法仍存在侷限性：缺乏對最優訓練目標的共識、在大規模應用中出現災難性遺忘現象，且難以靈活整合各類專業化生成模型的龐大生態系。本研究通過對Gemma3與Qwen3模型家族進行系統化消融實驗，揭示了成功適配的關鍵因素，特別指岀常被忽略的先驗掩碼階段的重要作用。為實現無需原始預訓練數據的規模化適配，我們提出結合線性權重合併與輕量級多領域數據混合的雙重策略，有效緩解災難性遺忘問題。最後通過將編碼器與專業因果模型融合，實現模態與領域專屬能力的無縫遷移。這套適用於任意因果解碼器LLM的開源方案，催生了BidirLM模型家族——五款在文本、視覺及音頻表徵基準測試中全面優化現有方案的編碼器。

English

Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

BidirLM：通過適配與組合因果語言模型實現從文本到全模態雙向編碼器的演進

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

摘要

Support