多样化词典学习

摘要

在仅能获得观测数据X = g(Z)且潜变量Z与生成过程g均未知的情况下，若缺乏额外假设，恢复Z是不适定问题。现有方法通常假设线性关系或依赖辅助监督与函数约束，但这些假设在实践中难以验证，且多数理论保证在轻微违背时便会失效，导致理解隐藏世界的可靠性存疑。为使可辨识性在现实场景中具可操作性，我们提出互补视角：在完全可辨识性不可达的一般设定下，哪些内容仍能保证恢复？哪些偏差可被普遍采纳？我们通过引入多样化字典学习问题来形式化这一视角。具体而言，研究表明：即使没有强假设，与任意观测相关联的潜变量的交集、补集及对称差，以及潜变量到观测的依赖结构，仍可在适当不确定性范围内被辨识。这些集合论结果可通过集合代数组合，构建隐藏世界的结构化本质视图（如属加种差定义）。当存在足够结构多样性时，它们可进一步推导出所有潜变量的完全可辨识性。值得注意的是，所有可辨识性优势均源自估计过程中可轻松集成至大多数模型的简单归纳偏置。我们通过合成数据与真实数据验证了理论并证明了该偏置的优越性。

English

Given only observational data X = g(Z), where both the latent variables Z and the generating process g are unknown, recovering Z is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To make identifiability actionable in the real-world scenarios, we take a complementary view: in the general settings where full identifiability is unattainable, what can still be recovered with guarantees, and what biases could be universally adopted? We introduce the problem of diverse dictionary learning to formalize this view. Specifically, we show that intersections, complements, and symmetric differences of latent variables linked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as genus-differentia definitions. When sufficient structural diversity is present, they further imply full identifiability of all latent variables. Notably, all identifiability benefits follow from a simple inductive bias during estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.