面向大语言模型生成代码片段的高效可扩展溯源追踪

摘要

大语言模型（LLMs）在代码补全与生成中的应用日益广泛，但其可能一字不差地复现训练示例且未标注作者归属，由此引发关于抄袭与许可合规的法律及伦理问题。基于指纹识别的经典抄袭检测方法（如Winnowing）仍非常有效，但检查过程需将代码片段与整个训练集比对，且线性时间搜索使其难以应用于训练现代代码LLM所需的十亿级语料库。为弥合这一差距，我们提出SOURCETRACKER——一个专为代码检索设计的3亿参数编码器，并配套构建了混合两阶段溯源追踪流水线HYBRIDSOURCETRACKER（HST）。HST首先通过向量搜索缩小候选片段范围，再利用Winnowing对精确指纹进行候选重排序。我们基于THESTACKV2数据集中的1000万片段子集进行系统训练与评估，其中既包含逐字复制的片段，也包含模拟真实标识符重命名的改编片段。在包含10万片段的体外搜索空间中使用改编查询时，我们的混合方法对30个词元片段达到了与Winnowing相当的平均倒数排名。当起始窗口≥60个词元时，其性能持续提升最高达5.4%，同时保持对数时间查询复杂度。在基于LLM的辅助评估中，我们发现许多未被标记为真值的检索片段仍与预期来源高度相似，尤其在更长上下文窗口下，因此对最终用户仍有实用价值。总体而言，我们的结果表明，将向量搜索与指纹识别相结合，能够实现对LLM生成代码的可扩展、高精度溯源追踪。

English

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.