ChatPaper.aiChatPaper

C2LLM技术报告:通过自适应交叉注意力池化开启代码检索新纪元

C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

December 24, 2025
作者: Jin Qin, Zihan Liao, Ziyin Zhang, Hang Yu, Peng Di, Rui Wang
cs.AI

摘要

我们推出C2LLM(对比式代码大语言模型)系列——包含0.5B和7B两种规模的代码嵌入模型。该模型基于Qwen-2.5-Coder架构,创新性地采用多头注意力池化模块(PMA)从词元嵌入生成序列嵌入,其优势在于:1)有效利用预训练阶段获得的大语言模型因果表征能力;2)能聚合序列中所有词元信息,突破基于EOS的序列嵌入存在的信息瓶颈;3)支持嵌入维度的灵活适配,可作为MRL方法的替代方案。经过三百万公开数据训练,C2LLM在同等规模模型中刷新了MTEB-Code基准测试纪录,其中C2LLM-7B更在总排行榜上位列第一。
English
We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.
PDF162February 8, 2026