Simulation Study of Markov Chain Composite Likelihood and its Application in Recombination Model


DNA sequencing technologies are rapidly advancing, allowing researchers access to data which is both high quality and highly detailed. In particular, these technologies are able to record the allele found at single nucleotide polymorphism (SNP) sites on individual haplotypes. A central goal for SNP data is SNP mapping, which would facilitate advancements in genetics, including hierarchical trees that enrich our understanding of human evolutionary history. Though geneticists have detailed SNP data on extant humans, this is not the case for previous generations, necessitating estimation backward in time. There is a need for statistical methods that perform this estimation. The statistical question is: If we observe n current descendant binary sequences with length L, how can we estimate the unknown ancestor distribution while considering biological complexities? Recombination, a biological complexity involving an exchange of genetic material between chromosomes, can give descendants haplotypes which don’t match ancestral chromosomes. Sun (2011) proposed a Recombination Model which estimates the unknown ancestral distribution while considering a fixed probability of recombination. Markov chain composite likelihood (MCCL) is used to obtain estimates of the population frequency with which the ancestor will have a given binary sequence. Under the assumption that both ancestor and descendant sequences follow an order-m Markov Chain structure, hierarchical estimation is used to estimate the joint distribution from marginal estimates. Here, we run simulations for this estimator and focus on the use of MCCL and selection of fixed quantities. Our data-generating mechanism will be done via resampling using data from the International HapMap Project, allowing sample proportions to simulate a true ancestor distribution. Performance measures will include bias and standard error of both joint and marginal estimates, bootstrapped confidence intervals, and total density correctly assigned to true non-zero probability sequences. Marginal distribution results show that the method provides estimates with low bias and standard error, but show evidence of a directional effect of the use of MCCL such that both bias and standard error increase for sites further from the start of the chain. Joint distribution results show a trade-off between bias and standard error; increasing m decreases the bias but increases the standard error. The joint density sums show that nearly all of the density is assigned to either true non-zero sequences or sequences which are 85 % similar. Finally, to make this methodology accessible, an R package recombinationMCCL is currently under development with a preliminary version available on Github.



Statistics, Genetics, Genetic Recombination, Markov Chain, Composite Likelihood, Simulation Study