TimeHC-RL: 시간 인지적 계층적 강화 학습을 통한 대형 언어 모델의 사회적 지능 향상

초록

최근 대형 언어 모델(LLMs)은 수학 및 코딩과 같이 신중한 사고를 요구하는 IQ 관련 영역에서 상당한 진전을 이루었습니다. 그러나 사회적 영역에서 LLMs의 인지 발달을 향상시키는 것, 특히 사후 훈련(post-training) 관점에서의 연구는 아직 미흡한 상태입니다. 사회적 세계는 수학과 달리 독특한 시간적 흐름을 따르며, 직관적 반응(System 1)과 표면적 사고에서부터 신중한 사고(System 2)에 이르기까지 더 풍부한 인지 모드의 조합을 필요로 한다는 점을 인식하여, 우리는 LLMs의 사회적 지능을 향상시키기 위한 시간 인지적 계층적 강화 학습(Temporal-aware Hierarchical Cognitive Reinforcement Learning, TimeHC-RL)을 제안합니다. 실험에서는 다양한 데이터 패턴을 가진 8개의 데이터셋에 대해 5가지 사후 훈련 패러다임과 2가지 테스트 시점 개입 패러다임을 통해 LLMs의 사회적 지능을 체계적으로 탐구하고 TimeHC-RL 방법의 효과를 검증했습니다. 실험 결과는 널리 채택된 System 2 RL 방법과 비교하여 우리가 제안한 TimeHC-RL 방법의 우수성을 보여줍니다. 이 방법은 7B 백본 모델에 날개를 달아 DeepSeek-R1 및 OpenAI-O3와 같은 고급 모델의 성능에 필적할 수 있게 합니다. 또한, 사후 훈련 및 테스트 시점 개입 관점에서 LLMs의 사회적 지능을 향상시키기 위한 체계적인 탐구는 여러 유용한 통찰을 제공했습니다.

English

Recently, Large Language Models (LLMs) have made significant progress in IQ-related domains that require careful thinking, such as mathematics and coding. However, enhancing LLMs' cognitive development in social domains, particularly from a post-training perspective, remains underexplored. Recognizing that the social world follows a distinct timeline and requires a richer blend of cognitive modes (from intuitive reactions (System 1) and surface-level thinking to deliberate thinking (System 2)) than mathematics, which primarily relies on System 2 cognition (careful, step-by-step reasoning), we introduce Temporal-aware Hierarchical Cognitive Reinforcement Learning (TimeHC-RL) for enhancing LLMs' social intelligence. In our experiments, we systematically explore improving LLMs' social intelligence and validate the effectiveness of the TimeHC-RL method, through five other post-training paradigms and two test-time intervention paradigms on eight datasets with diverse data patterns. Experimental results reveal the superiority of our proposed TimeHC-RL method compared to the widely adopted System 2 RL method. It gives the 7B backbone model wings, enabling it to rival the performance of advanced models like DeepSeek-R1 and OpenAI-O3. Additionally, the systematic exploration from post-training and test-time interventions perspectives to improve LLMs' social intelligence has uncovered several valuable insights.