CAVEWOMAN: 大規模言語モデルが言語入力・出力圧縮下でどのように振る舞うか

要旨

「短く話せ。文法を捨てろ。トークンを節約しろ。」この原始的なスタイルは、推論コストを削減する方法として広く推奨されているが、実際に節約が成立するかどうかは、どのチャンネル（ユーザーのプロンプトかモデルの応答か）が圧縮されるかに依存する。本稿では、Cavewomanという二重チャンネル評価プロトコルを提案する。これは、各生成に対して、タスク精度、項目ごとの実現コスト、およびモデルの非制約参照に対する参照テキスト一致度を評価するものである。5つのデータセットにおいて、8つのモデルを5段階の圧縮レベルで評価し、両方のチャンネルを同一項目上で測定した。出力圧縮は、ほとんどのAPIモデルで実現コストを削減し（モデルごとに1.4～2.4倍、最良の場合で最大3倍）、公開価格帯のもとでは4つのオープンウェイトモデルすべてで削減効果があった。一方、入力圧縮は逆効果であり、完全な負け局面をもたらす。すなわち、コストを下げるどころかむしろ増加させ（5つのベンチマーク平均で約1.15倍、最悪のデータセットでは1.8倍、より強い圧縮では2.7倍）、これはモデルが精度の低下にもかかわらずより長い応答で補償するためである。同じ設定下では、表層テキストは非制約参照から乖離する。すなわち、非推論型モデルでは、全生成の約半数が正解であるにもかかわらず、その表層テキストはモデル自身の非制約ベースライン生成をもはや含意しない。この乖離は、長さを制御した再スコアリング、多重比較補正、および補完的な意味尺度による再現を経ても持続する。コードとデータは https://github.com/danielle34/cavewoman で入手可能である。

English

"Talk short. Drop grammar. Save token." This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user's prompt or the model's response) is being compressed. We present Cavewoman, a two-channel evaluation protocol that scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model's unconstrained reference. We evaluate eight models on five datasets at five reduction levels, with both channels measured on the same items. Output compression cuts realized cost on most API models (1.4-2.4x per model, up to 3x in the best case) and on all four open-weight models under public-tier pricing. Input compression has the opposite effect, a strict lose-lose: it raises net cost rather than lowering it (~1.15x on the five-benchmark mean, up to 1.8x on the worst dataset and 2.7x under stronger compression), because models compensate with longer responses even as accuracy collapses. Under the same setting, surface text diverges from the unconstrained reference: on the non-reasoning models, roughly half of all generations are correct yet their surface text no longer entails the model's own unconstrained baseline generation. The divergence survives length-controlled re-scoring, multiple-comparisons correction, and replication under complementary semantic measures. Code and data are available at https://github.com/danielle34/cavewoman.