LLM 적응성의 한계: 모델 내재화된 사전 지식이 주석 작업 성능에 미치는 영향

초록

대규모 언어 모델(LLM)은 제로샷 주석 및 LLM-심판(LLM-as-a-judge) 작업에 점점 더 많이 사용되고 있지만, 그 신뢰성은 모델 내재적 사전 지식과 사용자가 제공한 지시가 상호작용하는 방식에 달려 있다. 우리는 이러한 상호작용의 세 가지 차원, 즉 (1) 데이터와 작업 정의에 대한 LLM의 친숙도가 성능에 미치는 영향, (2) 프롬프트의 추가 정보가 제로샷 오류를 얼마나 교정할 수 있는지("결정 고착성"), (3) 모델이 잘못 정렬된 작업 정의에 취약한 정도를 조사한다. 밀집 모델과 혼합 전문가 모델을 모두 사용하여 다양한 데이터셋(소셜 미디어, 게임, 뉴스, 포럼)에 걸친 독성 탐지 실험을 통해, 제로샷 오류의 거의 3분의 2가 교정에 저항하며, 전반적인 구제율(프롬프트로 교정된 초기 오류의 비율)은 34.8%에 불과함을 발견했다. 특히 신뢰도가 높은 오류는 교정에 더욱 강한 저항성을 보였다. 잘못 정렬된 정의가 주어졌을 때, LLM은 그 정의를 따르면서도 신뢰도 수준은 정렬된 조건과 동일하게 유지했다. 결정적으로, 우리는 모델의 내재적 개념과 작업 정의 간의 정렬을 측정하는 정의 특화 친숙도(DSF)를 도입한다. 데이터셋 수준의 혼란 변수를 통제한 후, DSF는 모델 성능과 양의 상관관계를 보였으며(부분 상관계수 r = +0.41), 반면 세 가지 서로 다른 암기 지표(ROUGE-L, BERTScore, 임베딩 코사인 유사도)는 모두 양의 상관관계를 나타내지 않았다. 이러한 발견은 주석 작업에서 프롬프트 기반 교정의 한계를 보여주며, 텍스트 수준의 암기보다 정의 정렬의 중요성을 강조한다.

English

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.