LLM生成コードスニペットのための効率的かつスケーラブルな来歴追跡

要旨

コード補完および生成を目的とした大規模言語モデル（LLM）はソフトウェア開発においてますます利用されているが、学習事例を著者帰属なしに逐語的に再現することがあり、剽窃やライセンス準拠に関する法的・倫理的な懸念を引き起こしている。Winnowingなどの古典的なフィンガープリントベースの剽窃検出器は依然として高い有効性を持つものの、その検査にはコード断片を学習セット全体と比較する必要があり、線形時間探索に依存するため、現代のコードLLMの学習に用いられる数十億規模のコーパスでは実用的ではない。このギャップを埋めるために、我々はコード検索に特化した3億パラメータのエンコーダSOURCETRACKERと、ハイブリッドな二段階来歴追跡パイプラインHYBRIDSOURCETRACKER（HST）を導入する。HSTはまずベクトル検索によって少数の候補スニペット集合を絞り込み、次にそれらの候補をWinnowingによる正確なフィンガープリントで再ランク付けする。我々はTHESTACKV2データセットの1000万スニペット部分集合を用いてシステムを学習・評価し、現実的な識別子のリネームを模倣した逐語的スニペットと適応スニペットの両方を含む。適応クエリを用いたin vitroの10万スニペット検索空間において、我々のハイブリッド手法は30トークン断片に対してWinnowingと同等の平均逆順位（MRR）を達成する。さらに60トークン以上のウィンドウからは、対数時間のクエリ計算量を維持しつつ、最大5.4%の一貫した性能向上を示す。LLMベースの判定器を用いた補完評価では、正解データとしてラベル付けされていない多くの検索スニペットが、特に長いコンテキストウィンドウにおいて期待されるソースと非常に類似しており、エンドユーザーにとって有用であることが判明した。全体として、我々の結果はベクトル検索とフィンガープリントの統合が、LLMが生成したコードに対するスケーラブルで高精度な来歴追跡を可能にすることを示している。

English

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.