Non-parametric MCMC Gibbs sampler approach and misclassification assessment of estimating haplotype frequencies among related statistical approaches
Abstract
Abstract
Introduction: Haplotype analysis allows higher resolution analysis in genetic association studies and is used as a reference panel for genotype imputation in genome-wide association studies. Haplotypes estimates from genotypes among unrelated individuals, but misclassification of the haplotype reconstruction will directly affect the accuracy of the results.
Methods: This study proposes a novel statistical method Gibbs sampler algorithm to estimate haplotype frequency and quantify the influence of misclassification bias of the estimate haplotype. The performance of the algorithm is evaluated on simulated datasets assuming that linkage phase unknown. The simulation used different minor allele frequencies at each single nucleotide polymorphisms (SNPs) and different linkage-disequilibrium between the SNPs.
Results: The Gibbs sampler algorithm presents higher accuracy among over seven SNPs or less, validated, and deals with missing genotype compared to previous related statistical approaches. Misclassification of estimated haplotypes leads to non-difference bias in exposure and affects haplotype estimates in haplotype analysis. The observed odds ratio underestimates the association between haplotype and phenotype by 36% to 99%.
Conclusion: The Gibbs sampler algorithm provides higher accuracy and robust effectiveness performance, handles missing genotypes, and provides uncertain probabilities of haplotype frequencies. The misclassification bias of the estimate haplotype underestimates the genetic association by more than forty percent.
1. Ken-Dror G, Humphries SE, Drenos F.
The use of haplotypes in the identification of
interaction between SNPs. Hum Hered 2013;
75(1): 44-51.
2. Marchini J, Howie B. Genotype
imputation for genome-wide association
studies. Nat Rev Genet 2010; 11(7): 499-511.
3. Slatkin M. Linkage disequilibrium-
-understanding the evolutionary past and
mapping the medical future. Nat Rev Genet
2008; 9(6): 477-85.
4. Niu T. Algorithms for inferring
haplotypes. Genet Epidemiol 2004; 27(4):
334-47.
5. Excoffier L, Slatkin M. Maximum-
likelihood estimation of molecular haplotype
frequencies in a diploid population. Mol Biol
Evol 1995; 12(5): 921-7.
6. Purcell S, Neale B, Todd-Brown K,
et al. PLINK: a tool set for whole-genome
association and population-based linkage
analyses. Am J Hum Genet 2007; 81(3): 559-
75.
7. Sinnwell JP, Schaid, D.J. haplo.
stats: Statistical Analysis of Haplotypes with
Traits and Covariates when Linkage Phase is
Ambiguous. 2020.
8. Stephens M, Scheet P. Accounting for
decay of linkage disequilibrium in haplotype
inference and missing-data imputation. Am J
Hum Genet 2005; 76(3): 449-62.
9. Scheet P, Stephens M. A fast and
flexible statistical model for large-scale
population genotype data: applications to
inferring missing genotypes and haplotypic
phase. Am J Hum Genet 2006; 78(4): 629-44.
10. Li Y, Willer CJ, Ding J, Scheet P,Abecasis GR. MaCH: using sequence and
genotype data to estimate haplotypes and
unobserved genotypes. Genet Epidemiol 2010;
34(8): 816-34.
11. Howie BN, Donnelly P, Marchini J.
A flexible and accurate genotype imputation
method for the next generation of genome-
wide association studies. PLoS Genet 2009;
5(6): e1000529.
12. Browning SR, Browning BL. Rapid
and accurate haplotype phasing and missing-
data inference for whole-genome association
studies by use of localized haplotype clustering.
Am J Hum Genet 2007; 81(5): 1084-97.
13. R Core Team. R: A language and
environment for statistical computing. R
Foundation for Statistical Computing, Vienna,
Austria. https://www.R-project.org/. 2020.
14. Brooks S, Brooks S, Gelman A, Jones
G, Meng X-L, Brooks S. Handbook of Markov
chain Monte Carlo. Boca Raton, Fl: CRC
Press; 2011.
15. Gilks WR, Richardson S, Spiegelhalter
DJ. Markov chain Monte Carlo in practice.
London ; New York: Chapman & Hall; 1996.
16. Roberts GO, Sahu SK. Updating
Schemes, Correlation Structure, Blocking
and Parameterization for the Gibbs Sampler.
Journal of the Royal Statistical Society Series
B 1997; 59: 291-317.
17. Li X, Foulkes AS, Yucel RM, Rich
SM. An expectation maximization approach
to estimate malaria haplotype frequencies in
multiply infected children. Stat Appl Genet
Mol Biol 2007; 6: Article33.
18. Adkins RM. Comparison of the
accuracy of methods of computational
haplotype inference using a large empirical
dataset. BMC Genet 2004; 5: 22.
19. Fallin D, Schork NJ. Accuracy of
haplotype frequency estimation for biallelic
loci, via the expectation-maximization
algorithm for unphased diploid genotype data.
Am J Hum Genet 2000; 67(4): 947-59.
20. Istrail S, Waterman MS, Clark AG.
Computational methods for SNPs and
Haplotype inference : DIMACS/RECOMB
satellite workshop, Piscataway, NJ, USA,
November 2002 revised papers / Sorin Istrail,
Michael Waterman, Andrew Clark, (eds.).
Berlin ; New York: Springer-Verlag; 2004.
21. Tishkoff SA, Pakstis AJ, Ruano G,
Kidd KK. The accuracy of statistical methods
for estimation of haplotype frequencies: an
example from the CD4 locus. Am J Hum Genet
2000; 67(2): 518-22.
22. Sabbagh A, Darlu P. Inferring
haplotypes at the NAT2 locus: the
computational approach. BMC Genet 2005; 6:
30.
23. Cotlarciuc I, Marjot T, Khan MS, et al.
Towards the genetic basis of cerebral venous
thrombosis-the BEAST Consortium: a study
protocol. BMJ Open 2016; 6(11): e012351.
24. Ken-Dror G, Cotlarciuc I, Martinelli I,
et al. Genome-wide association study identifies first locus associated with susceptibility to
cerebral venous thrombosis. Ann Neurol 2021.
25. Lash TL, Fox MP, Fink AK,
SpringerLink. Applying Quantitative Bias
Analysis to Epidemiologic Data. New York,
NY: Springer New York : Imprint: Springer;
2009.
26. Stephens M, Smith NJ, Donnelly
P. A new statistical method for haplotype
reconstruction from population data. Am J
Hum Genet 2001; 68(4): 978-89.
27. Stephens M, Donnelly P. A
comparison of bayesian methods for haplotype
reconstruction from population genotype data.
Am J Hum Genet 2003; 73(5): 1162-9.
28. Ken-Dror G, Sharma P. Markov
chain Monte Carlo Gibbs sampler approach
for estimating haplotype frequencies among
multiple malaria infected human blood
samples. Malar J 2021; 20(1): 311.
29. Lunn D, Lunn D. The BUGS book :
a practical introduction to Bayesian analysis.
Boca Raton, FL London: CRC Press Chapman
& Hall; 2013.
30. Gelman A, Rubin DB. Inference from
iterative simulation using multiple sequences.
Statistical Science 1992; 7: 457–72.
31. Brooks SP, Gelman A. General methods
for monitoring convergence of iterative
simulations. Journal of Computational and
Graphical Statistics 1998; 7: 434–55.
32. Raftery AE, Lewis SM. One long run
with diagnostics: Implementation strategiesfor
Markov chain Monte Carlo. Statistical Science
1992; 7: 493-7.
33. Spiegelhalter WR, Gilks WR,
Richardson S, Spiegelhalter DJ. Markov chain
Monte Carlo in practice. Boca Raton, Fla:
Chapman & Hall; 1996.
34. Heidelberger P, Welch PD. A spectral
method for confidence interval generation
and run lengthcontrol in simulations.
Communications of the ACM 1981; 24(4):
233-45.
35. Heidelberger P, Welch PD. Simulation
run length control in the presence of an initial
transient. Operations Research 1983; 31(6):
1109-44.
36. Bernardo JM. Bayesian Statistics 4 :
proceedings of the 4th Valencia International
Meeting, April 15-20, 1991. Oxford: O.U.P;
1992.
37. Zeggini E, Morris A, ScienceDirect.
Analysis of complex disease association
studies: a practical guide. Amsterdam:
Elsevier; 2011.
38. Ken-Dror G, Hastings IM. Markov
chain Monte Carlo and expectation
maximization approaches for estimation of
haplotype frequencies for multiply infected
human blood samples. Malar J 2016; 15(1):
430.
Files | ||
Issue | Vol 8 No 3 (2022) | |
Section | Original Article(s) | |
DOI | https://doi.org/10.18502/jbe.v8i3.12304 | |
Keywords | ||
Haplotype Reconstruction Single Nucleotide Polymorphisms Markov chain Monte Carlo Gibbs Sampler Algorithm Misclassification Bias |
Rights and permissions | |
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |