利用大型内容和行为模型来理解、模拟和优化内容和行为。
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
September 1, 2023
作者: Ashmit Khandelwal, Aditya Agrawal, Aanisha Bhattacharyya, Yaman K Singla, Somesh Singh, Uttaran Bhattacharya, Ishita Dasgupta, Stefano Petrangeli, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy
cs.AI
摘要
在他引入信息论的开创性论文中,香农将通信分为三个层次:技术、语义和效果。技术层面关注于准确重建传输符号,而语义和效果层面则涉及推断含义及其对接收者的影响。得益于电信技术,第一层面的问题已经取得了像互联网这样的巨大进展。大型语言模型(LLMs)在第二个目标上取得了一些进展,但第三个层面仍然基本未被触及。第三个问题涉及预测和优化通信以实现期望的接收者行为。LLMs虽然展现出在各种任务上的广泛泛化能力,但无法解决这个问题。造成表现不佳的一个原因可能是LLMs训练语料库中缺乏“行为标记”。行为标记定义了通信过程中接收者的行为,如分享、点赞、点击、购买、转发等。在为LLMs训练预处理数据时,通常会将行为标记从语料库中移除以减少干扰。因此,在本文中,我们在LLMs训练中初步尝试重新引入行为标记。训练的模型不仅在内容理解任务上表现出与LLMs相似的性能,还展现出在行为模拟、内容模拟、行为理解和行为领域适应方面的泛化能力。通过在两个语料库上进行各种任务,我们展示了所有这些能力的结果。我们将这些模型称为大型内容和行为模型(LCBMs)。此外,为了激励更多关于LCBMs的研究,我们发布了我们的新内容行为语料库(CBC),这是一个包含通信者、消息和相应接收者行为的存储库。
English
Shannon, in his seminal paper introducing information theory, divided the
communication into three levels: technical, semantic, and effectivenss. While
the technical level is concerned with accurate reconstruction of transmitted
symbols, the semantic and effectiveness levels deal with the inferred meaning
and its effect on the receiver. Thanks to telecommunications, the first level
problem has produced great advances like the internet. Large Language Models
(LLMs) make some progress towards the second goal, but the third level still
remains largely untouched. The third problem deals with predicting and
optimizing communication for desired receiver behavior. LLMs, while showing
wide generalization capabilities across a wide range of tasks, are unable to
solve for this. One reason for the underperformance could be a lack of
"behavior tokens" in LLMs' training corpora. Behavior tokens define receiver
behavior over a communication, such as shares, likes, clicks, purchases,
retweets, etc. While preprocessing data for LLM training, behavior tokens are
often removed from the corpora as noise. Therefore, in this paper, we make some
initial progress towards reintroducing behavior tokens in LLM training. The
trained models, other than showing similar performance to LLMs on content
understanding tasks, show generalization capabilities on behavior simulation,
content simulation, behavior understanding, and behavior domain adaptation.
Using a wide range of tasks on two corpora, we show results on all these
capabilities. We call these models Large Content and Behavior Models (LCBMs).
Further, to spur more research on LCBMs, we release our new Content Behavior
Corpus (CBC), a repository containing communicator, message, and corresponding
receiver behavior.