邊聽邊思考：音頻分類的簡易測試時間縮放

摘要

我們提出了一個框架，使神經模型能夠在聆聽日常聲音時「邊聽邊思考」，從而提升音頻分類的性能。受大型語言模型推理能力最新進展的啟發，我們探討了兩個核心問題：(i) 如何將思考機制融入現有的音頻分類流程中，以實現類別空間的推理並提升性能；(ii) 能否從零開始設計一種新架構，同時支持思考機制和測試時的規模擴展？我們證明，在這兩種情境下，我們的模型均展現出更高的分類準確率。通過利用測試時的規模擴展，我們觀察到隨著採樣軌跡數量的增加，性能持續提升。此外，我們評估了兩個開源推理模型——GPT-OSS-20B 和 Qwen3-14B，結果表明，雖然這類模型具備零樣本推理能力，但一種輕量級方法——僅重新訓練凍結的較小模型（如 GPT-2）的嵌入矩陣——能夠超越基於文本的十億參數量級推理模型的性能。

English

We propose a framework that enables neural models to "think while listening" to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach--retraining only the embedding matrix of a frozen, smaller model like GPT-2--can surpass the performance of billion-parameter text-based reasoning models.

邊聽邊思考：音頻分類的簡易測試時間縮放

Thinking While Listening: Simple Test Time Scaling For Audio Classification

摘要

Support