研究者指定协变量下基于大语言模型的文本分析的条件假设生成

摘要

计算社会科学的一个核心目标是发现语言在感兴趣的结果（如政治倾向或教学质量）中如何变化的可解释差异。近年来，基于大语言模型的假设生成方法用自然语言描述这类差异，但仅选取全局区分性模式，而未考虑研究者基于领域知识所指定的协变量。忽视协变量会导致所选模式反映的是混杂因素而非实质性的差异。本文提出条件假设生成框架，该框架引入研究者指定的协变量，将假设发现引导至在相关子群内成立的差异。这面临两个挑战：目标子群可能代表性不足（分层不平衡），且差异的方向可能在子群间发生反转（符号反转）。我们提出两种受计量经济学启发的方法：一种引入特征与协变量的交互项以检测符号反转，另一种采用层内去均值与逆频率重加权来平衡代表性不足的层。合成实验表明，每种方法在其针对性场景中均优于全局基线；对两个真实数据集的专家评估证实，考虑协变量的生成能在相关子群中挖掘出更有用的假设。

English

A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.