LLM 생성 코드 스니펫의 효율적이고 확장 가능한 출처 추적

초록

대규모 언어 모델(LLM)을 활용한 코드 완성 및 생성 기술이 소프트웨어 개발에 점차 널리 사용되고 있지만, 훈련 예제를 저자 표기 없이 그대로 재현할 수 있어 표절 및 라이선스 준수와 관련된 법적·윤리적 우려가 제기된다. Winnowing과 같은 전통적인 핑거프린팅 기반 표절 탐지기는 여전히 높은 효과성을 보이나, 검사 과정에서 코드 조각을 전체 훈련 데이터셋과 비교해야 하며 선형 시간 검색이 소요되므로 현대 코드 LLM 훈련에 사용되는 수십억 규모의 말뭉치에는 실용적이지 않다. 이러한 격차를 해소하기 위해, 우리는 코드 검색에 특화된 300M 파라미터 인코더인 SOURCETRACKER와 함께 하이브리드 2단계 출처 추적 파이프라인인 HYBRIDSOURCETRACKER(HST)를 제안한다. HST는 먼저 벡터 검색을 통해 소수의 후보 코드 조각 집합을 좁히고, 이후 정확한 핑거프린트를 기반으로 Winnowing을 적용하여 해당 후보들을 재순위화한다. 우리는 THESTACKV2 데이터셋의 1000만 조각 부분집합을 대상으로 시스템을 훈련 및 평가하였으며, 현실적인 식별자 이름 변경을 모사한 원문 그대로의 조각과 변형된 조각을 모두 사용하였다. 변형된 질의로 10만 조각 검색 공간을 대상으로 한 시험관 내 실험에서, 하이브리드 접근법은 30토큰 조각에 대해 Winnowing과 동등한 평균 역순위(MRR)를 달성하였다. 이후 60토큰 이상의 윈도우부터는 최대 5.4%까지 일관되게 더 나은 성능을 보이면서도 로그 시간 질의 복잡성을 유지하였다. LLM 기반 평가자를 활용한 추가 평가에서, 많은 검색된 조각이 실제 정답으로 레이블링되지 않았음에도 특히 긴 컨텍스트 윈도우에서 예상 출처와 매우 유사하여 최종 사용자에게 유용함을 확인하였다. 전반적으로, 본 결과는 벡터 검색과 핑거프린팅을 통합함으로써 LLM이 생성한 코드에 대한 확장 가능하고 정밀도 높은 출처 추적이 가능함을 보여준다.

English

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.