大型语言模型编排的BERT学视角：面向高效单次分类的令牌与层级选择性探针

摘要

当前生产级大语言模型系统通常依赖独立模型处理安全检测等分类密集型任务，这会导致延迟增加、显存占用扩大及运维复杂度提升。我们提出通过复用服务模型已完成的计算来优化这一流程：基于其隐藏状态训练轻量级探测头，在生成任务的同一次前向传播中完成标签预测。我们将分类任务重新定义为对完整词元-层级隐藏状态张量的表征选择，而非拘泥于固定词元（如首词元逻辑值）或固定层级（如末层池化）。为实现该目标，我们设计了双阶段聚合器：（i）逐层级内进行词元摘要；（ii）跨层级聚合摘要信息形成单一分类表征。具体实现采用直接池化法、10万参数规模的评分注意力门控机制，以及最多包含3500万可训练参数的下采样多头自注意力探测头。在安全检测与情感分析基准测试中，我们的探测头相较于仅复用逻辑值的方法（如MULI）表现更优，并与参数量显著更大的专用基线模型性能相当，同时保持近乎服务原型的延迟水平，避免了独立防护模型流水线带来的显存与延迟开销。

English

Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline.

大型语言模型编排的BERT学视角：面向高效单次分类的令牌与层级选择性探针

A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

摘要

Support