ChatPaper.aiChatPaper

智能体系统设计的信息论视角

An Information Theoretic Perspective on Agentic System Design

December 25, 2025
作者: Shizhe He, Avanika Narayan, Ishan S. Khare, Scott W. Linderman, Christopher Ré, Dan Biderman
cs.AI

摘要

智能体语言模型(LM)系统驱动着"深度研究"和"Claude代码"等现代应用,通过多LM架构突破上下文限制。这些系统表面差异背后存在统一模式:较小的"压缩器"LM(甚至可本地运行)将原始上下文提炼为紧凑文本,再由较大的"预测器"LM处理。尽管广泛应用,压缩器-预测器系统的设计仍主要依赖经验法则,缺乏关于组件选择如何影响下游性能的指导。实践中,区分压缩与预测的贡献需要昂贵的任务特定配对扫描。我们认为这些智能系统设计问题本质上是信息论问题。通过将压缩器LM视为噪声信道,我们提出一种简单的互信息估计器,以任务无关方式量化压缩质量。研究表明,互信息能强预测下游性能,且独立于具体任务。基于信息论框架,我们在五个数据集和三个模型族上展开全面实证分析。结果显示:更大压缩器不仅更精确,且更具标记效率——每个标记传递更多比特信息。例如,70亿参数的Qwen-2.5压缩器相比其15亿参数版本,准确度提升1.6倍,简洁度提高4.6倍,单标记互信息传输量增加5.5倍。跨数据集实验表明,扩展压缩器比扩展预测器更有效,使得更大本地压缩器可搭配更小云端预测器。应用于深度研究系统时,这些原则让仅30亿参数的本地压缩器能以26%的API成本实现前沿LM 99%的准确度。
English
Agentic language model (LM) systems power modern applications like "Deep Research" and "Claude Code," and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller "compressor" LMs (that can even run locally) distill raw context into compact text that is then consumed by larger "predictor" LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token. A 7B Qwen-2.5 compressor, for instance, is 1.6times more accurate, 4.6times more concise, and conveys 5.5times more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors. Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover 99% of frontier-LM accuracy at 26% of API costs.
PDF60December 31, 2025