암묵적 행동 벤치마크: 대규모 언어 모델의 무의식적 행동 적응 측정

초록

기존 LLM 에이전트 메모리 벤치마크는 명시적 사실 회상만 평가하고, 경험이 의식적 인출 없이 자동화된 행동으로 전환되는 암묵적 기억을 간과해왔습니다. 이러한 격차는 중요합니다. 효과적인 보조자는 명시적 상기 없이도 학습된 절차를 자동으로 적용하거나 실패한 행동을 회피해야 하기 때문입니다. 본 연구에서는 표준 인지과학의 비선언적 기억 이론에서 도출한 세 가지 인지적 구성 요소(절차기억(간섭 후 일회성 기술 습득), 프라이밍(짝지어진 실험/통제 인스턴스를 통한 주제 유도 편향), 고전적 조건형성(조건자극-무조건자극(CS-US) 연관이 초기 결정에 미치는 영향))를 통해 암묵적 기억을 평가하는 최초의 체계적 벤치마크인 ImplicitMemBench를 소개합니다. 300개 항목으로 구성된 본 평가 세트는 일관된 학습/프라이밍-간섭-테스트 프로토콜과 초도 시도 기반 채점 방식을 적용합니다. 17개 모델 평가 결과 심각한 한계가 드러났습니다: 전체 평균 66%를 넘는 모델이 없으며, 최상위 성능 모델인 DeepSeek-R1(65.3%), Qwen3-32B(64.1%), GPT-5(63.0%) 모두 인간 기준치에 크게 미치지 못했습니다. 분석 결과 극심한 비대칭성(억제 17.6% 대 선호 75.0%)과 매개변수 확장 이상의 구조적 혁신이 필요한 보편적 병목 현상이 확인되었습니다. ImplicitMemBench는 평가의 초점을 "에이전트가 무엇을 회상하는가"에서 "무엇을 자동으로 수행하는가"로 전환합니다.

English

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".

암묵적 행동 벤치마크: 대규모 언어 모델의 무의식적 행동 적응 측정

ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

초록

Support