当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases
Genome Research ( IF 7 ) Pub Date : 2023-12-01 , DOI: 10.1101/gr.278322.123
Prashant S. Emani , Maya N. Geradi , Gamze Gürsoy , Monica R. Grasty , Andrew Miranker , Mark B. Gerstein

Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated “mosaics” (two-individual composites). Using PLIGHT on a database with ∼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, ∼20 can identify both components in two-individual mosaics, and 20–30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using ∼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.

中文翻译:

通过与单倍型数据库的局部比对来评估和减轻稀疏、嘈杂基因型的隐私风险

组学数据中的单核苷酸多态性 (SNP) 会给个人及其亲属带来重新识别的风险。尽管数以千计的 SNP(尤其是罕见的 SNP)识别个体的能力已被反复证明,但来自环境 DNA 样本或功能基因组数据的小组噪音基因型的可用性促使我们量化其信息量。我们提出了一个计算工具套件,称为隐私泄漏通过基因型 HMM 轨迹推理 (PLIGHT),使用基于群体遗传学的重组和突变隐马尔可夫模型 (HMM) 来查找小型、噪声 SNP 集的分段对齐以参考单倍型数据库。我们探索了查询个体是否已知在数据库中的情况,并考虑了几种基因型查询,包括来自已知个体的环境样本拭子和模拟“马赛克”(两个个体复合物)的基因型查询。在包含约 5000 个单倍型的数据库上使用 PLIGHT,我们发现对于常见的无噪声 SNP,只有 10 个足以识别个体,约 20 个可以识别两个个体嵌合体中的两个组成部分,20-30 个可以识别一级亲属。使用噪声环境样本衍生的 SNP,PLIGHT 使用 ∼30 个 SNP 识别数据库中的个体。即使个体不在数据库中,本地基因型匹配也会导致基于粗粒度 SNP 插补的一些表型信息泄漏。最后,通过量化稀疏 SNP 集的隐私泄露,PLIGHT 有助于确定选择性清理已发布的 SNP 的价值,而无需明确假设群体成员资格或等位基因频率。为了实现这一点,我们提供了一个清理工具,可以从基因组数据中删除最具识别性的 SNP。
更新日期:2023-12-01
down
wechat
bug