BidirLM：因果的LLMの適応と構成によるテキストから全モーダル双方向エンコーダへの展開

要旨

因果的生成言語モデルを双方向エンコーダーへ変換することは、BERTスタイルのアーキテクチャに代わる有力な手法を提供する。しかし、現在のアプローチには依然として限界がある。最適な訓練目標について合意がなく、大規模化に伴う破滅的忘説が生じ、専門的な生成モデルの広大なエコシステムを柔軟に統合できない。本研究では、Gemma3およびQwen3ファミリーを用いた体系的なアブレーション研究を通じて、適応を成功させる主要因を特定し、しばしば省略される事前マスキング段階の重要性を明らかにする。元の事前学習データなしでこのプロセスをスケールさせるため、線形重みマージと軽量なマルチドメインデータ混合を組み合わせた二重戦略を導入し、破滅的忘説を軽減する。最後に、専門的な因果モデルとマージすることでエンコーダーを拡張し、モダリティ固有およびドメイン固有の能力をシームレスに転移する。あらゆる因果デコーダーLLMを対象としたこのオープンソースのレシピにより、テキスト・画像・音声の表現ベンチマークで従来手法を上回る5つのエンコーダーからなるBidirLMファミリーを実現した。

English

Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

BidirLM：因果的LLMの適応と構成によるテキストから全モーダル双方向エンコーダへの展開

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

要旨

Support