BidirLM：通过适配与组合因果大语言模型实现从文本到全模态双向编码器的演进

摘要

将因果生成式语言模型转化为双向编码器，为BERT式架构提供了强大的替代方案。然而现有方法仍存在局限：缺乏对最优训练目标的共识，在大规模应用中遭遇灾难性遗忘，且难以灵活整合专业生成模型的庞大生态。本研究通过对Gemma3和Qwen3模型家族的系统性消融实验，揭示了成功适配的关键因素，特别指出常被忽略的先验掩码阶段的核心作用。为实现无需原始预训练数据的规模化适配，我们提出结合线性权重融合与轻量级多领域数据混合的双重策略，有效缓解灾难性遗忘。最终通过将编码器与专业因果模型融合，实现模态与领域特定能力的无缝迁移。这套适用于任意因果解码器LLM的开源方案催生了BidirLM系列——五个在文本、视觉及音频表征基准测试中全面领先的编码器模型。

English

Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.