대규모 콘텐츠 및 행동 모델: 콘텐츠와 행동의 이해, 시뮬레이션, 최적화를 위해

초록

샤넌은 정보 이론을 소개한 그의 획기적인 논문에서 커뮤니케이션을 세 가지 수준으로 나누었습니다: 기술적, 의미론적, 그리고 효과성. 기술적 수준은 전송된 기호의 정확한 재구성에 관심을 두는 반면, 의미론적 및 효과성 수준은 추론된 의미와 그것이 수신자에게 미치는 영향을 다룹니다. 통신 기술 덕분에 첫 번째 수준의 문제는 인터넷과 같은 큰 발전을 이루었습니다. 대형 언어 모델(LLM)은 두 번째 목표에 대해 어느 정도 진전을 이루었지만, 세 번째 수준은 여전히 크게 미개발 상태로 남아 있습니다. 세 번째 문제는 원하는 수신자 행동을 예측하고 이를 위해 커뮤니케이션을 최적화하는 것을 다룹니다. LLM은 다양한 작업에서 광범위한 일반화 능력을 보여주지만, 이를 해결할 수 없습니다. 이러한 성능 저하의 한 가지 이유는 LLM의 훈련 코퍼스에 "행동 토큰"이 부족하기 때문일 수 있습니다. 행동 토큰은 공유, 좋아요, 클릭, 구매, 리트윗 등과 같은 커뮤니케이션 과정에서의 수신자 행동을 정의합니다. LLM 훈련을 위해 데이터를 전처리할 때, 행동 토큰은 종종 노이즈로 간주되어 코퍼스에서 제거됩니다. 따라서 본 논문에서는 LLM 훈련에 행동 토큰을 재도입하는 데 있어 초기 진전을 이루었습니다. 훈련된 모델은 콘텐츠 이해 작업에서 LLM과 유사한 성능을 보이는 것 외에도, 행동 시뮬레이션, 콘텐츠 시뮬레이션, 행동 이해, 그리고 행동 도메인 적응에서 일반화 능력을 보여줍니다. 두 개의 코퍼스에 대한 다양한 작업을 사용하여 이러한 모든 능력에 대한 결과를 보여줍니다. 우리는 이러한 모델을 대형 콘텐츠 및 행동 모델(LCBM)이라고 부릅니다. 또한, LCBM에 대한 더 많은 연구를 촉진하기 위해, 우리는 새로운 콘텐츠 행동 코퍼스(CBC)를 공개합니다. 이 저장소는 커뮤니케이터, 메시지, 그리고 해당하는 수신자 행동을 포함하고 있습니다.

English

Shannon, in his seminal paper introducing information theory, divided the communication into three levels: technical, semantic, and effectivenss. While the technical level is concerned with accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Thanks to telecommunications, the first level problem has produced great advances like the internet. Large Language Models (LLMs) make some progress towards the second goal, but the third level still remains largely untouched. The third problem deals with predicting and optimizing communication for desired receiver behavior. LLMs, while showing wide generalization capabilities across a wide range of tasks, are unable to solve for this. One reason for the underperformance could be a lack of "behavior tokens" in LLMs' training corpora. Behavior tokens define receiver behavior over a communication, such as shares, likes, clicks, purchases, retweets, etc. While preprocessing data for LLM training, behavior tokens are often removed from the corpora as noise. Therefore, in this paper, we make some initial progress towards reintroducing behavior tokens in LLM training. The trained models, other than showing similar performance to LLMs on content understanding tasks, show generalization capabilities on behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. Using a wide range of tasks on two corpora, we show results on all these capabilities. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior.

대규모 콘텐츠 및 행동 모델: 콘텐츠와 행동의 이해, 시뮬레이션, 최적화를 위해

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

초록

Support