アシスタントターンを超えて：言語モデルの対話認識プローブとしてのユーザーターン生成

要旨

標準的なLLMベンチマークは、アシスタントの発話を評価する。モデルが入力に対して応答を生成し、検証者が正確性を採点し、分析はそこで終了する。このパラダイムでは、LLMがアシスタント応答の後に続く内容について何らかの意識を符号化しているかどうかが測定されないままである。我々は、このギャップを探るためのプローブとして、ユーザー発話生成を提案する。ユーザークエリとアシスタント応答からなる会話コンテキストが与えられたとき、モデルにユーザーの役割で生成させる。もしモデルの重みがインタラクションへの意識を符号化しているならば、生成されるユーザー発話は、先行するコンテキストに反応した、接地されたフォローアップとなるはずである。11のオープンウェイトLLM（Qwen3.5, gpt-oss, GLM）と5つのデータセット（数学推論、指示追従、会話）にわたる実験を通じて、インタラクションへの意識とタスクの正確性は分離されていることを示す。特にQwen3.5ファミリーでは、GSM8Kの正解率は41%（0.8B）から96.8%（397B-A17B）までスケールするが、決定的生成下での真のフォローアップ率はほぼゼロのままである。対照的に、より高い温度パラメータによるサンプリングでは、インタラクションへの意識が潜在的に存在し、フォローアップ率が22%に達することが明らかになった。制御された摂動実験により、提案するプローブがモデルの実在する特性を測定していることが検証され、協調指向の事後学習を施したQwen3.5-2Bではフォローアップ率の増加が示された。我々の結果は、ユーザー発話生成が、現在のアシスタントのみを評価するベンチマークでは未探索かつ不可視であるLLMの振る舞いの次元、すなわちインタラクションへの意識を捉えていることを示す。

English

Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across 11 open-weight LLMs (Qwen3.5, gpt-oss, GLM) and 5 datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from 41% (0.8B) to 96.8% (397B-A17B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching 22%. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.

アシスタントターンを超えて：言語モデルの対話認識プローブとしてのユーザーターン生成

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

要旨

Support