高效且可擴展的大型語言模型生成程式碼片段溯源追蹤

摘要

大型語言模型（LLMs）在程式碼補全與生成領域的應用日益廣泛，然而這類模型可能未經署名即逐字複製訓練樣例，引發抄襲與授權合規等法律與倫理問題。傳統基於指紋的抄襲偵測方法（如Winnowing）雖然效果顯著，但其比對過程需將程式碼片段與完整訓練集進行比較，且線性時間搜尋使其難以應用於訓練現代程式碼LLMs所需的數十億級語料庫。為解決此問題，我們提出SOURCETRACKER——一款專為程式碼檢索設計的3億參數編碼器，並搭配混合式兩階段溯源追蹤管道HYBRIDSOURCETRACKER（HST）。HST首先透過向量搜尋縮小候選片段範圍，再以Winnowing對精確指紋進行候選重新排序。我們在THESTACKV2資料集的1000萬片段子集上進行系統訓練與評估，並採用逐字片段及模擬真實識別符更名情境的改編片段。在模擬的10萬片段搜尋空間中，針對改編查詢，我們的混合方法對30個語法單元（token）的片段可達到與Winnowing相當的平均倒數排名（MRR）。而當片段長度達60個語法單元以上時，該方法持續表現更佳，最高提升5.4%，同時維持對數時間等級的查詢複雜度。在輔助評估中，我們使用基於LLM的評判器發現：許多未被標記為真實標準的檢索片段仍與預期來源高度相似（尤其於較長上下文視窗時），因此對最終使用者仍具實用價值。整體而言，本研究證明整合向量搜尋與指紋比對技術，可實現對LLMs生成程式碼的大規模、高精度溯源追蹤。

English

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.