듣기 중 사고: 오디오 분류를 위한 간단한 테스트 시간 스케일링

초록

우리는 신경망 모델이 일상 소리를 "들으면서 생각"할 수 있게 하는 프레임워크를 제안함으로써 오디오 분류 성능을 향상시킵니다. 대규모 언어 모델의 추론 능력에 대한 최근의 발전에 영감을 받아, 우리는 두 가지 핵심 질문을 다룹니다: (i) 기존 오디오 분류 파이프라인에 어떻게 '생각하기'를 통합하여 카테고리 공간에서의 추론을 가능하게 하고 성능을 개선할 수 있는가, 그리고 (ii) '생각하기'와 테스트 시간 스케일링을 모두 지원할 수 있는 새로운 아키텍처를 처음부터 설계할 수 있는가? 우리는 두 가지 설정 모두에서 우리의 모델이 분류 정확도가 개선됨을 보여줍니다. 테스트 시간 스케일링을 활용하여, 샘플링된 트레이스의 수가 증가함에 따라 일관된 성능 향상을 관찰합니다. 또한, 두 가지 오픈소스 추론 모델인 GPT-OSS-20B와 Qwen3-14B를 평가하며, 이러한 모델들이 제로샷 추론이 가능하지만, GPT-2와 같은 작은 모델의 임베딩 행렬만 재학습하는 경량 접근 방식이 수십억 파라미터 기반 텍스트 추론 모델의 성능을 능가할 수 있음을 보여줍니다.

English

We propose a framework that enables neural models to "think while listening" to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach--retraining only the embedding matrix of a frozen, smaller model like GPT-2--can surpass the performance of billion-parameter text-based reasoning models.

듣기 중 사고: 오디오 분류를 위한 간단한 테스트 시간 스케일링

Thinking While Listening: Simple Test Time Scaling For Audio Classification

초록

Support