CAVEWOMAN：大型语言模型在语言输入与输出压缩下的行为表现

摘要

"少说废话，省略语法，节省令牌。"这种穴居人风格被广泛推广为降低推理成本的方法，但其实际效果取决于压缩的是哪个通道（用户的提示还是模型的响应）。我们提出Cavewoman，一个双通道评估协议，该协议对每次生成的任务准确性、实现的实际单次成本以及模型无约束参考文本的一致性进行评分。我们在五个数据集上使用五个缩减级别评估了八种模型，并在相同项上测量两个通道的结果。输出压缩降低了大多数API模型的实际成本（每个模型1.4-2.4倍，最佳情况下可达3倍），且在公共定价下所有四个开放权重模型均有此效果。输入压缩则产生相反效果，严格意义上的双输：它不仅没有降低成本，反而提高了净成本（五个基准平均约1.15倍，最差数据集上达1.8倍，更强压缩下达2.7倍），因为模型会通过更长响应来补偿，即使准确性急剧下降。在同一设置下，表层文本与无约束参考出现分歧：在非推理模型上，约一半的生成内容虽然正确，但其表层文本不再蕴含模型自身的无约束基准生成内容。这种分歧在长度控制重评分、多重比较校正以及互补语义度量的重复验证中仍然存在。代码和数据可在 https://github.com/danielle34/cavewoman 获取。

English

"Talk short. Drop grammar. Save token." This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user's prompt or the model's response) is being compressed. We present Cavewoman, a two-channel evaluation protocol that scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model's unconstrained reference. We evaluate eight models on five datasets at five reduction levels, with both channels measured on the same items. Output compression cuts realized cost on most API models (1.4-2.4x per model, up to 3x in the best case) and on all four open-weight models under public-tier pricing. Input compression has the opposite effect, a strict lose-lose: it raises net cost rather than lowering it (~1.15x on the five-benchmark mean, up to 1.8x on the worst dataset and 2.7x under stronger compression), because models compensate with longer responses even as accuracy collapses. Under the same setting, surface text diverges from the unconstrained reference: on the non-reasoning models, roughly half of all generations are correct yet their surface text no longer entails the model's own unconstrained baseline generation. The divergence survives length-controlled re-scoring, multiple-comparisons correction, and replication under complementary semantic measures. Code and data are available at https://github.com/danielle34/cavewoman.