생성 추천 시스템의 일반화 성능은 어느 정도인가?

초록

생성 추천(Generative Recommendation, GR) 모델이 기존의 항목 ID 기반 모델을 능가하는 이유에 대한 널리 받아들여지는 가설은 이들이 일반화를 더 잘 수행하기 때문이라는 것이다. 그러나 전반적 성능을 피상적으로 비교하는 것을 넘어 이 가설을 검증할 체계적인 방법은 거의 없다. 이러한 격차를 해결하기 위해 우리는 각 데이터 인스턴스를 정확한 예측에 필요한 구체적 능력에 따라 분류한다: 기억화(훈련 중 관찰된 항목 전이 패턴 재사용) 또는 일반화(알려진 패턴을 조합하여 보지 못한 항목 전이 예측). 광범위한 실험 결과, GR 모델은 일반화가 필요한 인스턴스에서 더 잘 수행되는 반면, 항목 ID 기반 모델은 기억화가 더 중요할 때 더 나은 성능을 보인다. 이러한 차이를 설명하기 위해 우리는 분석 수준을 항목 수준에서 토큰 수준으로 전환하고, GR 모델의 경우 항목 수준 일반화로 보이는 현상이 종종 토큰 수준 기억화로 귀결됨을 보인다. 마지막으로, 두 패러다임이 상호 보완적임을 확인한다. 우리는 인스턴스별로 두 방식을 적응적으로 결합하는 간단한 기억화 인지 지표를 제안하며, 이는 전반적 추천 성능 향상으로 이어진다.

English

A widely held hypothesis for why generative recommendation (GR) models outperform conventional item ID-based models is that they generalize better. However, there is few systematic way to verify this hypothesis beyond a superficial comparison of overall performance. To address this gap, we categorize each data instance based on the specific capability required for a correct prediction: either memorization (reusing item transition patterns observed during training) or generalization (composing known patterns to predict unseen item transitions). Extensive experiments show that GR models perform better on instances that require generalization, whereas item ID-based models perform better when memorization is more important. To explain this divergence, we shift the analysis from the item level to the token level and show that what appears to be item-level generalization often reduces to token-level memorization for GR models. Finally, we show that the two paradigms are complementary. We propose a simple memorization-aware indicator that adaptively combines them on a per-instance basis, leading to improved overall recommendation performance.

생성 추천 시스템의 일반화 성능은 어느 정도인가?

How Well Does Generative Recommendation Generalize?

초록

Support