InCoder-32B：面向工业场景的代码基础模型

摘要

近期，代码大语言模型在通用编程任务上取得了显著进展。然而，在需要理解硬件语义、专用语言结构及严格资源约束的工业场景中，其性能明显下降。为应对这些挑战，我们推出InCoder-32B（工业级代码生成器-32B），这是首个320亿参数规模的代码基础模型，可统一支持芯片设计、GPU内核优化、嵌入式系统、编译器优化及三维建模等领域的代码智能。通过采用高效架构，我们采用通用代码预训练、工业代码精炼、中期训练（将上下文长度从8K逐步扩展至128K并辅以合成工业推理数据）以及基于执行验证的后训练四阶段策略，对InCoder-32B进行从零开始的全程训练。我们在14个主流通用代码基准测试和涵盖4个专业领域的9个工业基准测试上开展广泛评估。结果表明，InCoder-32B在通用任务中表现出强大竞争力，同时为各工业领域建立了坚实的开源基线。

English

Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.