多样化字典学习

摘要

在仅能获得观测数据X = g(Z)（其中潜变量Z和生成过程g均未知）的前提下，若缺乏额外假设，恢复Z是一个不适定问题。现有方法通常假设线性关系或依赖辅助监督与函数约束，然而这类假设在实践中往往难以验证，且多数理论保证在轻微违背假设时便会失效，导致我们难以可靠地理解隐藏世界。为使可辨识性在现实场景中具有可操作性，我们提出互补性视角：在无法实现完全可辨识性的一般设定下，哪些内容仍能以保证的方式被恢复？哪些偏差可被普遍采纳？我们通过引入多样化字典学习问题来形式化这一视角。具体而言，我们证明：即使没有强假设，与任意观测相关联的潜变量的交集、补集及对称差，以及潜变量到观测变量的依赖结构，仍能在适当的不确定性范围内被辨识。这些集合论结果可通过集合代数运算组合起来，构建隐藏世界的结构化本质视角（如种差定义）。当存在足够的结构多样性时，它们可进一步推导出所有潜变量的完全可辨识性。值得注意的是，所有可辨识性优势均源自估计过程中可轻松集成至大多数模型的简单归纳偏置。我们通过合成数据与真实数据验证了理论，并证明了该偏置的优越性。

English

Given only observational data X = g(Z), where both the latent variables Z and the generating process g are unknown, recovering Z is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To make identifiability actionable in the real-world scenarios, we take a complementary view: in the general settings where full identifiability is unattainable, what can still be recovered with guarantees, and what biases could be universally adopted? We introduce the problem of diverse dictionary learning to formalize this view. Specifically, we show that intersections, complements, and symmetric differences of latent variables linked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as genus-differentia definitions. When sufficient structural diversity is present, they further imply full identifiability of all latent variables. Notably, all identifiability benefits follow from a simple inductive bias during estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.

多样化字典学习

Diverse Dictionary Learning

摘要

Support