커버리지 원칙: 조합적 일반화를 이해하기 위한 프레임워크

초록

대형 언어 모델은 패턴 매칭에서 뛰어난 성능을 보이지만, 체계적인 조합적 일반화(compositional generalization)에서는 종종 부족함을 보입니다. 우리는 '커버리지 원칙(coverage principle)'을 제안합니다: 이는 데이터 중심 프레임워크로, 주로 패턴 매칭에 의존하는 모델들이 동일한 맥락에서 사용될 때 동일한 결과를 산출하는 조각들만을 대체하는 방식으로는 신뢰할 수 있는 일반화를 달성할 수 없음을 보여줍니다. 우리는 이 프레임워크가 트랜스포머(Transformers)의 일반화 능력을 강력하게 예측할 수 있음을 입증합니다. 첫째, 두 홉(two-hop) 일반화를 위해 필요한 훈련 데이터가 토큰 집합 크기에 최소한 이차적으로 증가하며, 20배의 매개변수 스케일링으로도 훈련 데이터 효율성이 개선되지 않음을 이론적으로 도출하고 실험적으로 확인합니다. 둘째, 하나의 변수가 여러 계산 경로를 통해 출력에 영향을 미치는 경로 모호성(path ambiguity)이 있는 조합적 작업에서, 트랜스포머는 성능과 상호 운용성을 모두 저해하는 맥락 의존적 상태 표현(context-dependent state representations)을 학습함을 보입니다. 셋째, 사고의 연쇄(Chain-of-Thought) 지도학습은 다중 홉(multi-hop) 작업의 훈련 데이터 효율성을 개선하지만 여전히 경로 모호성에 어려움을 겪습니다. 마지막으로, 우리는 신경망이 일반화할 수 있는 세 가지 방식을 구분하는 메커니즘 기반 분류 체계를 제시합니다: 구조 기반(커버리지에 의해 제한됨), 속성 기반(대수적 불변성을 활용함), 공유 연산자(함수 재사용을 통해). 이 개념적 렌즈는 우리의 결과를 맥락화하고 체계적인 조합성을 달성하기 위해 새로운 아키텍처 아이디어가 필요한 부분을 강조합니다. 전반적으로, 커버리지 원칙은 조합적 추론을 이해하기 위한 통합된 관점을 제공하며, 진정한 체계적인 조합성을 달성하기 위해서는 근본적인 아키텍처 혹은 훈련 방식의 혁신이 필요함을 강조합니다.

English

Large language models excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a mechanism-based taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.

커버리지 원칙: 조합적 일반화를 이해하기 위한 프레임워크

The Coverage Principle: A Framework for Understanding Compositional Generalization

초록

Support