CAVEWOMAN: 언어 입력 및 출력 압축 시 대규모 언어 모델의 동작 양상

초록

"짧게 말하라. 문법을 생략하라. 토큰을 아껴라." 이 동굴인 스타일은 추론 비용을 줄이는 방법으로 널리 권장되지만, 실제로 절감 효과가 있는지는 어떤 채널(사용자 프롬프트 또는 모델 응답)이 압축되는지에 달려 있다. 우리는 작업 정확도, 실현된 항목당 비용, 그리고 모델의 비제약적 참조에 대한 참조 텍스트 일치도를 기준으로 모든 생성 결과를 평가하는 이중 채널 평가 프로토콜인 Cavewoman을 제시한다. 5개 데이터셋, 5가지 축소 수준, 8개 모델을 동일한 항목에 대해 두 채널을 모두 측정하여 평가했다. 출력 압축은 대부분의 API 모델(모델당 1.4~2.4배, 최적의 경우 최대 3배)과 공개 가격 체계의 모든 4개 오픈 가중치 모델에서 실현 비용을 절감했다. 입력 압축은 반대 효과를 보였으며, 엄격한 손실-손실 상황을 초래했다. 비용을 낮추는 대신 순비용을 증가시켰으며(5개 벤치마크 평균 약 1.15배, 최악의 데이터셋에서 최대 1.8배, 더 강한 압축에서 최대 2.7배), 이는 모델이 정확도가 하락하는 상황에서도 더 긴 응답으로 이를 보상하기 때문이다. 동일한 설정에서 표면 텍스트는 비제약적 참조와 차이를 보였다. 비추론 모델의 경우, 생성 결과의 약 절반이 정확했지만, 표면 텍스트가 더 이상 모델 자체의 비제약적 기준 생성 결과를 함의하지 않았다. 이러한 차이는 길이를 통제한 재점수화, 다중 비교 보정, 그리고 보완적 의미 측정 하의 반복 실험에서도 유지되었다. 코드와 데이터는 https://github.com/danielle34/cavewoman에서 제공된다.

English

"Talk short. Drop grammar. Save token." This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user's prompt or the model's response) is being compressed. We present Cavewoman, a two-channel evaluation protocol that scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model's unconstrained reference. We evaluate eight models on five datasets at five reduction levels, with both channels measured on the same items. Output compression cuts realized cost on most API models (1.4-2.4x per model, up to 3x in the best case) and on all four open-weight models under public-tier pricing. Input compression has the opposite effect, a strict lose-lose: it raises net cost rather than lowering it (~1.15x on the five-benchmark mean, up to 1.8x on the worst dataset and 2.7x under stronger compression), because models compensate with longer responses even as accuracy collapses. Under the same setting, surface text diverges from the unconstrained reference: on the non-reasoning models, roughly half of all generations are correct yet their surface text no longer entails the model's own unconstrained baseline generation. The divergence survives length-controlled re-scoring, multiple-comparisons correction, and replication under complementary semantic measures. Code and data are available at https://github.com/danielle34/cavewoman.