当前位置: X-MOL 学术Annu. Rev. Stat. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Convergence Diagnostics for Entity Resolution
Annual Review of Statistics and Its Application ( IF 7.9 ) Pub Date : 2024-04-24 , DOI: 10.1146/annurev-statistics-040522-114848
Serge Aleshin-Guendel 1 , Rebecca C. Steorts 1, 2, 3
Affiliation  

Entity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo (MCMC) sampling is the primary computational method for approximate posterior inference in this setting, but due to the high dimensionality of the space of partitions, there are no agreed upon standards for diagnosing nonconvergence of MCMC sampling. In this article, we review Bayesian entity resolution, with a focus on the specific challenges that it poses for the convergence of a Markov chain. We review prior methods for convergence diagnostics, discussing their weaknesses. We provide recommendations for using MCMC sampling for Bayesian entity resolution, focusing on the use of modern diagnostics that are commonplace in applied Bayesian statistics. Using simulated data, we find that a commonly used Gibbs sampler performs poorly compared with two alternatives.

中文翻译:

实体解析的收敛诊断

实体解析是从多个数据源合并和删除重复记录的过程,通常在没有唯一标识符的情况下。用于实体解析的贝叶斯模型允许包含先验信息,量化重要应用程序中的不确定性,并直接估计记录的分区。马尔可夫链蒙特卡罗(MCMC)采样是这种情况下近似后验推理的主要计算方法,但由于分区空间的高维性,对于诊断 MCMC 采样的不收敛性没有达成一致的标准。在本文中,我们回顾了贝叶斯实体解析,重点关注它对马尔可夫链收敛带来的具体挑战。我们回顾了先前的收敛诊断方法,讨论了它们的弱点。我们提供使用 MCMC 抽样进行贝叶斯实体解析的建议,重点关注应用贝叶斯统计中常见的现代诊断方法的使用。使用模拟数据,我们发现常用的吉布斯采样器与两种替代方案相比表现较差。
更新日期:2024-04-24
down
wechat
bug