BidirLM: 인과적 LLM의 적응과 구성을 통한 텍스트에서 오므니모달 양방향 인코더로의 진화

초록

인과적 생성 언어 모델을 양방향 인코더로 변환하는 것은 BERT 스타일 아키텍처에 대한 강력한 대안을 제공합니다. 그러나 현재의 접근법은 여전히 한계가 있습니다: 최적의 훈련 목표에 대한 합의가 부족하고, 대규모 적용 시 치명적 망각 현상이 발생하며, 방대한 특화 생성 모델 생태계를 유연하게 통합하지 못합니다. 본 연구에서는 Gemma3 및 Qwen3 모델군에 대한 체계적인 애블레이션을 통해 성공적인 적응을 주도하는 핵심 요소를 규명하고, 흔히 생략되는 사전 마스킹 단계의 중요성을 부각합니다. 원본 사전 훈련 데이터 없이 이 과정을 확장하기 위해, 선형 가중치 병합과 경량의 다중 도메인 데이터 혼합을 결합한 이중 전략을 도입하여 치명적 망각 현상을 완화합니다. 마지막으로, 특화된 인과적 모델과의 병합을 통해 인코더를 증강하여 모달리티 및 도메인 특화 능력을 원활하게 이전합니다. 모든 인과적 디코더 LLM에 적용 가능한 이 오픈소스 레시피를 통해 텍스트, 비전, 오디오 표현 벤치마크에서 기존 방법들을 능가하는 5가지 인코더로 구성된 BidirLM 패밀리를 구현했습니다.

English

Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

BidirLM: 인과적 LLM의 적응과 구성을 통한 텍스트에서 오므니모달 양방향 인코더로의 진화

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

초록

Support