단일 시퀀스 내에서 병렬 디코딩을 통해 병렬화 가능한 추론 가속화

초록

최근 추론 모델의 발전은 특히 수학적 추론과 같은 복잡한 작업에서 상세하고 포괄적인 추론 과정을 통해 정확도 측면에서 상당한 개선을 보여주었습니다. 그러나 이러한 긴 추론 시퀀스를 생성하는 것은 계산 비용이 많이 들고 시간이 소요됩니다. 이러한 비효율성을 해결하기 위해, 우리는 특정 작업의 본질적인 병렬화 가능성을 활용하여 추론 과정을 가속화합니다. 구체적으로, 여러 병렬 추론 분기가 존재할 때, 특수화된 어텐션 마스크를 사용하여 단계당 여러 토큰을 디코딩하고 이를 단일 시퀀스 내에서 처리함으로써 추가적인 메모리 사용을 방지합니다. 실험 결과, 우리의 방법은 답변 품질을 유지하면서 디코딩 시간에서 100% 이상의 속도 향상을 달성함을 보여줍니다.

English

Recent advances in reasoning models have demonstrated significant improvements in accuracy, particularly for complex tasks such as mathematical reasoning, by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning branches exist, we decode multiple tokens per step using a specialized attention mask, processing them within a single sequence, avoiding additional memory usage. Experimental results show that our method achieves over 100% speedup in decoding time while maintaining the answer quality.

단일 시퀀스 내에서 병렬 디코딩을 통해 병렬화 가능한 추론 가속화

Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

초록

Support