LLMに基づくテキスト分析における研究者指定の共変量を用いた条件付き仮説生成

要旨

計算社会科学の核心的な目標の一つは、政治的信条や指導の質といった関心のある結果に応じて言語がどのように異なるかについて、解釈可能な差異を発見することである。近年のLLMに基づく仮説生成手法は、このような差異を自然言語で記述するが、研究者の領域知識に基づいてデータを形成する共変量を考慮せずに、全体的に識別力のあるパターンを選択する。共変量が無視されると、選択されたパターンは実質的な関心対象の差異ではなく、交絡を反映することになる。本稿では、研究者が指定した共変量を組み込むことで、関連するサブグループ内で成立する差異へと仮説発見を誘導するフレームワークである、条件付き仮説生成を導入する。ここで二つの課題が生じる。すなわち、対象サブグループの過小代表性（層の不均衡）と、サブグループ間で差異の方向が逆転する可能性（符号反転）である。我々は、計量経済学に着想を得た二つの手法を提案する。一つは特徴量と共変量の交互作用を導入して符号反転を検出する手法、もう一つは層内平均差し引きと逆頻度再重み付けを適用して過小代表な層を均等化する手法である。合成実験により、各手法が標的とする設定において全体的なベースラインを上回る性能を示し、二つの実世界データセットに関する専門家評価により、共変量を考慮した生成が、関連サブグループ内でより有用な仮説を導き出すことが確認された。

English

A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.