跨越边界的推理：通过测试时校准提升规范对齐

摘要

大型语言模型（LLMs）正日益应用于多样化的现实场景中，每个场景都遵循由用户或组织量身定制的行为与安全规范（spec）。这些规范被划分为安全规范和行为规范，因场景而异，并随着偏好和需求的变化而演进。我们将这一挑战形式化为规范对齐问题，重点关注LLMs从行为和安全角度遵循动态、场景特定规范的能力。为应对这一挑战，我们提出了Align3，一种轻量级方法，采用测试时深思（TTD）结合分层反思与修订，以推理规范边界。此外，我们推出了SpecBench，一个统一的基准测试，用于衡量规范对齐，涵盖5个场景、103项规范和1,500个提示。通过对15个推理模型和18个指令模型进行实验，包括Self-Refine、TPO和MoreThink等多种TTD方法，我们得出三个关键发现：(i) 测试时深思提升了规范对齐；(ii) Align3以最小开销推进了安全性与实用性之间的权衡前沿；(iii) SpecBench有效揭示了对齐差距。这些结果凸显了测试时深思作为推理现实世界规范边界有效策略的潜力。

English

Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.

跨越边界的推理：通过测试时校准提升规范对齐

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration

摘要

Support