聴きながら考える：音声分類のための簡易テスト時間スケーリング

要旨

我々は、ニューラルモデルが日常音を「聞きながら考える」ことを可能にするフレームワークを提案し、それによって音声分類性能を向上させる。大規模言語モデルの推論能力の最近の進展に触発され、以下の2つの中心的な問いに取り組む：(i) 既存の音声分類パイプラインに「考える」プロセスを組み込むことで、カテゴリ空間での推論を可能にし、性能を向上させるにはどうすればよいか、(ii) 推論とテスト時のスケーリングの両方をサポートする新しいアーキテクチャをゼロから設計できるか。我々は、両方の設定において、提案モデルが分類精度の向上を示すことを実証する。テスト時のスケーリングを活用し、サンプリングされたトレースの数が増加するにつれて一貫した性能向上を観察する。さらに、オープンソースの推論モデルであるGPT-OSS-20BとQwen3-14Bを評価し、これらのモデルがゼロショット推論を可能にする一方で、GPT-2のような凍結された小型モデルの埋め込み行列のみを再学習する軽量アプローチが、数十億パラメータのテキストベース推論モデルの性能を凌駕し得ることを示す。

English

We propose a framework that enables neural models to "think while listening" to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach--retraining only the embedding matrix of a frozen, smaller model like GPT-2--can surpass the performance of billion-parameter text-based reasoning models.

聴きながら考える：音声分類のための簡易テスト時間スケーリング

Thinking While Listening: Simple Test Time Scaling For Audio Classification

要旨

Support