边听边思考：音频分类的简易测试时间扩展

摘要

我们提出了一种框架，使神经网络模型能够在“聆听”日常声音的同时进行“思考”，从而提升音频分类性能。受近期大型语言模型推理能力进展的启发，我们探讨了两个核心问题：(i) 如何将思考机制融入现有音频分类流程，以实现类别空间内的推理并提升性能；(ii) 能否从头设计一种新架构，同时支持思考与测试时扩展？我们证明，在这两种情境下，我们的模型均展现出更高的分类准确率。通过利用测试时扩展，我们观察到随着采样轨迹数量的增加，模型性能持续提升。此外，我们评估了两款开源推理模型——GPT-OSS-20B与Qwen3-14B，结果表明，尽管这类模型具备零样本推理能力，但一种轻量级方法——仅对冻结的小型模型（如GPT-2）的嵌入矩阵进行重训练——能够超越基于文本的数十亿参数推理模型的性能。

English

We propose a framework that enables neural models to "think while listening" to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach--retraining only the embedding matrix of a frozen, smaller model like GPT-2--can surpass the performance of billion-parameter text-based reasoning models.

边听边思考：音频分类的简易测试时间扩展

Thinking While Listening: Simple Test Time Scaling For Audio Classification

摘要

Support