# Chapter 5 Factors Influencing Inference of Clonality in Diploid Populations

Zhian N. Kamvar and Niklaus J. Grünwald

Target Journal: **Molecular Ecology**

## 5.1 Abstract

The index of association is a measure of multilocus linkage disequilibrium, which reflects the deviation of observed genetic variation from expected. In sexual populations, loci are randomly assorting due to recombination, resulting in a near-zero value of the index of association. In clonal populations, recombination is non-existent, meaning that loci are passed from parent to offspring in a non-independent fashion, resulting in a significantly non-zero value of the index of association. We build on previous work by investigating the effect of sample size, mutation rate, and clone correction on the power of the index of association to detect clonal reproduction in simulated data sets generated with microsatellite and genomic markers. Our findings suggest that power decreases with small sample sizes and low allelic diversity, and physical linkage in genomic markers does not affect power if permutation tests are performed with linkage groups. We hope that these novel insights provide useful to the study of molecular pathogens.

## 5.2 Introduction

Population genetic theory is largely based on model populations such as the neutral Wright-Fisher model in which populations are assumed to be infinitely large, with discrete generations, randomly assorting alleles, with no migration and no mutation (Hartl & Clark, 2007; Nielsen & Slatkin, 2013). By using such reductionist models, population geneticists are able to reduce the complexity and enable analyses to ask fundamental questions and test hypotheses about evolutionary processes that could explain population structure and history.

These neutral models, however, cannot be applied to populations whose life history violate the fundamental assumption of random assortment of alleles, such as populations that undergo clonal reproduction (Milgroom, 1996; Orive, 1993). For many clonal populations, the contribution of genetic variation from mutation is greater than that of recombination. While this increases the risk of the accumulation of deleterious mutations, the two-fold cost of sex is drastically reduced, meaning that selectively advantageous combinations of alleles are maintained (Heitman et al., 2012).

Pathogenic microorganisms can reproduce sexually or clonally (Milgroom, 1996; Tibayrenc, 1996). Diseases caused by these organisms are in part managed by the use of antimicrobial compounds that kill these organisms. Detecting recombination in populations of pathogenic microorganisms is therefore important for the implementation of rational management strategies as a prevalence of sexual reproduction could lead to the repeated emergence of novel, resistant genotypes (de Meeûs et al., 2006; Goss et al., 2014; Milgroom, 1996; Nieuwenhuis & James, 2016; Smith et al., 1993).

Several studies have attempted to infer the degree of sex in populations that undergo clonal reproduction (Ali et al., 2016; Balloux et al., 2003; de Meeûs & Balloux, 2004; Nieuwenhuis & James, 2016; Smith et al., 1993). For populations with well-defined sexual and clonal phases occurring at separate times, such as rust fungi, methods like *CloNcaSe* are effective for estimating the rate of sexual reproduction and effective population size (Ali et al., 2016). However, this method cannot be applied to populations where the reproductive cycle is not partitioned into discrete generations.

Simply detecting the presence of clonal reproduction, however can be useful in and of itself (Milgroom, 1996). A method commonly used to assess this is the index of association (\(I_A\)), and its standardized version, \(\bar{r}_d\), which measure multilocus linkage disequilibrium (Agapow & Burt, 2001; Brown et al., 1980; de Meeûs & Balloux, 2004; Haubold et al., 1998; Kamvar et al., 2014b; Smith et al., 1993). The value of \(I_A\), as shown in equation (5.1), is measured as the ratio of observed variance (\(V_O\)) and expected variance (\(V_E\)) in genetic distance between samples (Agapow & Burt, 2001; Smith et al., 1993):

\[\begin{equation} I_A = \frac{V_O}{V_E} - 1 \tag{5.1} \end{equation}\]The expected variance is practically modeled as the sum of the variances over *m* loci: \(V_E = \sum^m{var_j}\) (Agapow & Burt, 2001; Haubold et al., 1998). If the differences between samples are randomly distributed (linkage equilibrium), we can expect the value of \(I_A\) to be zero (Agapow & Burt, 2001; Smith et al., 1993). Under scenarios of non-random mating (e.g. population structure or clonal reproduction), the observed variance would be greater than the expected variance due to a multi-modal distribution of distances, and \(I_A\) would be greater than zero (Agapow & Burt, 2001; Milgroom, 2015; Smith et al., 1993). Agapow & Burt (2001) noted that this metric does not have an upper limit and increases with the number of loci. To correct this, they developed \(\bar{r}_d\) (equation (5.2)), which has a similar structure to a correlation coefficient and ranges from 0 (no linkage) to 1 (complete linkage):

De Meeûx & Balloux (2004) investigated the effect of increasing levels of sexual reproduction on \(\bar{r}_d\) (noted in their publication as \(\bar{r}_D\)). They found that very little (\(\sim\) 1%) sexual reproduction is required to produce a value of \(\bar{r}_d\) close to zero. This indicates that \(\bar{r}_d\) alone might not be well suited as a measure of clonal reproduction. Prugnolle & de Meeûs (2010) tested the effect of sampling design on \(\bar{r}_d\), finding that its value is drastically reduced when clones from multiple populations are sampled, leading to an over-estimation of the level of recombination.

These studies laid the groundwork for understanding the behavior of \(\bar{r}_d\) under different scenarios of non-random mating in diploid organisms, but there were some limitations in available technology. For both studies, the only software available for calculation of \(\bar{r}_d\) for diploid organisms was Multilocus, which could only take one data set at a time (Agapow & Burt, 2001; de Meeûs & Balloux, 2004; Kamvar et al., 2014b; Prugnolle & de Meeûs, 2010). This constrained researchers to only analyze a minimal set of 20 populations per scenario.

A non-zero value of \(I_A\) and \(\bar{r}_d\) does not always indicate a significant departure from the null assumption of unlinked loci. Since the distribution of \(I_A\) and \(\bar{r}_d\) are not known, the safest way to test for significance are random permutation tests. These tests effectively create unlinked populations by shuffling individuals at each locus, independently and re-calculating \(I_A\) and \(\bar{r}_d\) (Agapow & Burt, 2001; Haubold et al., 1998; Smith et al., 1993). An upper one-sided test of significance is then used to see if the observed statistic is greater than the observed distribution.

Standard practice for analyzing microbial populations is to perform this test on both the whole data set (wd) and clone-corrected data (cc), where each multilocus genotype is represented only once per population to avoid signatures of linkage that arise from re-sampling the same individual (Goss et al., 2014; McDonald, 1997; Milgroom, 1996). If the p-value for \(\bar{r}_d\) is significant after clone-correction, then the population is expected to be clonal. While significance testing is available in Multilocus in the form of random permutations, it is computationally expensive, and can only take one data set at a time (Agapow & Burt, 2001; Kamvar et al., 2014b). As a result, power analysis of \(\bar{r}_d\) to detect clonal reproduction with and without clone-correction has not yet been performed (de Meeûs & Balloux, 2004).

In the years since the studies conducted by de Meeûs & Balloux (2004) and Prugnolle & de Meeûs (2010), reduced-representation, high-throughput sequencing methods such as Genotyping-By-Sequencing (GBS) and Restriction site associated DNA sequencing (RAD-seq) have become popular tools for population genetic analysis (Davey & Blaxter, 2010; Davey et al., 2011; Elshire et al., 2011). These methods have the capability to generate thousands of unlinked markers at a fraction of the cost and time necessary to develop high quality microsatellite markers. These marker systems are also prone to missing data and high error rates (Mastretta-Yanes et al., 2014). The index of association was developed for multiple loci at a time when obtaining even 100 unlinked markers posed a significant challenge. With the advent of these technologies, it is unclear how marker choice and genotyping error affect the index of association.

We developed the R package *poppr* for analysis of clonal populations, removing the limitations of data input and computational expense of analyzing the index of association (Kamvar et al., 2014b) and further expanded this to analysis of genome-wide SNP data (Kamvar et al., 2015a; R Core Team, 2016). With these tools we expand on previous studies by asking how sample size, marker choice, clone-correction, and the assumption of homogeneous mutation rates affect our ability to detect clonal reproduction in diploid populations. Our objectives to answer these questions are to (1) re-analyze \(\bar{r}_d\) against increasing rates of sexual reproduction in both microsatellite and SNP data sets, (2) perform a power analysis of \(\bar{r}_d\) to assess sensitivity and specificity, and (3) assess how genotypic and allelic evenness and diversity affects \(\bar{r}_d\). Because studies have observed significantly negative values of \(I_A\) and \(\bar{r}_d\) (p \(\geq\) 0.95), we additionally seek to determine what factors result in negative \(\bar{r}_d\) values. This work provides novel insights into the sensitivity and scope of the index of association for inferring clonality.

## 5.3 Methods

We used simulations to evaluate the behavior of \(\bar{r}_d\) under different population scenarios. Initial sets of simulations were created for different levels of sexual reproduction for each marker type. All simulations were performed with the python package simuPOP version 1.1.7 in python version 3.4 (Peng & Amos, 2008). For each scenario, 100 simulations with 10 replicates were created with a census size of 10,000 diploid individuals with equal mating type proportions evolved over 10,000 generations. We chose a census size of 10,000 individuals to minimize the effect of drift. We chose to run the simulations for 10,000 generations because this was previously shown in de Meeûs & Balloux (2004) to be sufficient in reaching equilibrium for summary statistics.

The simulated populations were first stored in the native simuPOP format and then transferred to feather format using the python and R packages *feather* version 0.3.0 for downstream analyses. Microsatellite simulations were performed on Ubuntu Linux version 14.04; SNP simulations were performed on CentOS Linux version 6.8. During downstream analysis, 10, 25, 50, and 100 individuals were sampled without replacement for each replicate in R version 3.2 with the package *poppr* version 2.3.0 (Kamvar et al., 2015a, 2014b; R Core Team, 2016). Analyses (described below) were performed on both full and clone-corrected data sets. All downstream analyses were run on the OSU CGRB Core Computing Facility.

### 5.3.1 Simulating Microsatellite Loci

Each population was simulated with 21 co-dominant, unlinked loci containing 6 to 10 alleles per locus with frequencies drawn from a uniform distribution and subsequently normalized so that allele frequencies at each locus sum to one. We used 6 to 10 alleles per locus as this is the number of alleles commonly used for population genetic studies to avoid statistical noise. Before mating, mutations occurred at each locus at a rate of 1e-5 mutations/generation with the exception of the first locus, at which the mutation rate was set to 1e-3. The mutation rate of 1e-5 was selected as this was previously used in de Meeûs & Balloux (2004) and is close to the estimated recombination rate of *Saccharomyces cerevisiae* microsatellite loci (Lynch et al., 2008). All mutations were applied in a stepwise manner using the `StepwiseMutator()`

operator in simuPOP.

### 5.3.2 Simulating SNP Loci

Simulations of 10,000 diploid, biallelic loci spread evenly over 10 chromosomal fragments were simulated with a mutation rate of 1e-5 mutations per generation for forward and backward mutations using the `SNPMutator()`

operator and and a recombination rate of 0.01 between adjacent loci using the `Recombinator()`

operator in simuPOP.

### 5.3.3 Mating

Simulations of sexual reproduction were conducted at 10 rates of sexual reproduction over a log scale (0.0, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 0.1, 0.5, 1.0) reflecting the fraction of individuals in generation *t+1* produced via sexual reproduction. One to three offspring could be produced at each mating event. Rates were chosen to investigate the dynamics of \(\bar{r}_d\) at low levels of clonal reproduction. For sexual events, two parents were chosen randomly from the population with the `RandomSelection()`

operator and offspring genotypes were created via the `MendelianGenoTransmitter()`

operator. The clonal fraction was created by randomly sampling individuals from the population and duplicating their genotypes with the `CloneGenoTransmitter()`

operator. If one mating type was lost before 10,000 generations, the simulation would continue to completion with only clonal reproduction.

### 5.3.4 Analysis of Microsatellite Data

The standardized index of association (\(\bar{r}_d\), Agapow & Burt (2001)) was calculated for full and clone-corrected data using the *poppr* function `ia()`

(Kamvar et al., 2014b). Tests for significance were performed by randomly permuting the alleles at each locus independently and then assessing \(\bar{r}_d\). A total of 999 permutations were conducted for each replicate population to evaluate statistical support. Analyses were done for both full and clone-corrected data. The p-values reflect the proportion of observations greater than the observed statistic. Estimates of genotypic diversity were assessed with the *poppr* function `diversity_boot()`

with 999 bootstrap replicates, recording the estimate and variance (Kamvar et al., 2015a). The genotypic diversity statistics we calculated were Shannon’s Index, \(H = -\sum p_i ln p_i\) (Shannon, 1948), Stoddart and Taylor’s Index, \(G = 1/\sum p_i^2\) (Stoddart & Taylor, 1988), and Evenness, \(E_5 = (G - 1)/(e^H - 1)\) (Pielou, 1975) where \(p_i\) is the frequency of the *i*th genotype. We additionally calculated Nei’s expected heterozygosity, also known as gene diversity (Nei, 1978), and mean allelic evenness \(E_{5A} = (1/m) \sum E_{5l}\), where *m* is the number of loci and \(E_{5l}\) is the evenness of the alleles at locus *l* with the *poppr* function `locus_table()`

(Grünwald et al., 2003; Kamvar et al., 2014b).

### 5.3.5 Analysis of SNP Data

The overall value of \(\bar{r}_d\) was calculated for each simulation with the *poppr* function `bitwise.ia()`

(Kamvar et al., 2015a). Significance was first assessed by randomly shuffling genotypes at each locus independently and then, to preserve existing background linkage structure, at each chromosome independently. This was done 999 times for each replicate population. P-values represent the proportion of random samples greater than or equal to the observed statistic.

Because GBS data are associated with high error rates, we additionally wanted to assess the effect of missing data on analysis. To do this, we used scripts written for Kamvar et al. (2015a) to randomly insert missing data via the `pop_NA()`

function (Kamvar et al., 2015b) at rates of 1%, 5%, and 10% each.

### 5.3.6 Power Analysis

Permutation analysis of the index of association is commonly used to calculate statistical support for detecting non-random mating. If the observed value of \(\bar{r}_d\) is greater than 95% of the results from the permutations, then the null hypothesis of linkage equilibrium is rejected. The power or sensitivity of this test can be seen as the fraction of significant results within a set of simulation parameters when the rate of sexual reproduction is < 1.

The Receiver Operating Characteristic (ROC) curve is a statistical method of assessing the balance between sensitivity and specificity of a diagnostic method (Metz, 1978). This is done by simultaneously assessing the true positive fraction of tests to a false positive fraction along a gradient of thresholds of increasing leniency. A simple way of thinking about ROC analysis is to imagine two factories that both make marbles. On average, factory A makes marbles half the size of factory B. Both factories manufacture for the same distributor, which mixes the marbles from both factories. You are tasked with finding the best sieve that can accurately and precisely separate the marbles. To measure this quantity, you order a set of orange marbles from factory A and a set of blue marbles from factory B. To calculate the ROC curve, you would take a range of sieves and stack them such that the one with the largest mesh size is on top and the one with the smallest mesh size is on the bottom and pour both orange and blue marbles through. The fractions of orange and blue marbles that passed through each sieve are the true positive and false positive fractions, respectively. Plot the true positive fraction against the false positive fraction for each sieve respectively, and you can assess how well the sieve method works by examining the shape of the ROC curve.

A useful summary of the ROC curve is to calculate the area under the ROC curve (AURC). Briefly, if a method has perfect explanatory power, the area under the ROC curve will be equal to 1. If a method has no explanatory power, the AURC will be equal to 0.5. We used this method to assess the efficacy of \(\bar{r}_d\) to detect non-random mating.

To calculate the ROC curve, we first define what a positive value is. If we consider the p-value of \(\bar{r}_d\) as a classifier to determine clonality, we can define the false positive fraction as the fraction of observations where \(p \leq \alpha\) for simulations where the rate of sexual reproduction is set to 1.0 (Table 5.1). In contrast, the true positive fraction is the fraction of observations where \(p \leq \alpha\) in simulations where clonality is introduced (e.g. all simulations where the rate of sexual reproduction is less than 1.0).

Reproductive Mode | \(p \leq \alpha\) | \(p > \alpha\) |
---|---|---|

Non-Random Mating | True Positive | False Negative |

Random Mating | False Positive | True Negative |

We constructed each ROC curve by plotting the true positive fraction on the y axis and the false positive fraction on the x axis with values of \(\alpha\) in increments of 0.01 from 0 to 1. Curves were calculated hierarchically by rate of sexual reproduction (< 1) and a unique seed was used to generate the populations. For each hierarchical level, separate curves were calculated for sample size, mutation rate, and clone-correction. The areas under the ROC curves (AURC) were calculated using the `auc()`

function in the R package *flux* version 0.3-0 (Jurasinski et al., 2014).

Because simulations across rates of sexual reproduction were generated from unique seeds, we additionally constructed ROC curves for each seed with respect to clone-correction, sample size, and mutation rate. To test for significant effects of clone-correction, sample size, and mutation rate on the inference of non-random mating on microsatellite data, we then separated the data by rate of sexual reproduction and performed a three way (type III) ANOVA on each rate separately with formula (5.3). For SNP data, we did not have clone correction or mutation rate differences, so we set these variables to 1.

\[\begin{equation} AURC \sim Clone Correction \times Sample Size \times Mutation Rate \tag{5.3} \end{equation}\]Because of a pattern that we saw in the results of \(E_{5A}\) for microsatellite data, we asked whether or not using a value of \(E_{5A} \geq 0.85\) could help detect clonal reproduction even if the p-value was non-significant. To assess this, we conducted an additional ROC analysis to condition p-values on \(E_{5A}\) such that non-significant values that also had \(E_{5A} \geq 0.85\) would be considered significant.

## 5.4 Results

### 5.4.1 Clone Correction Negatively Impacts Clonal Inference

For microsatellite data, the values of \(\bar{r}_d\) for completely clonal and nearly clonal populations ranged from -0.577 to 1 in completely clonal data sets (Fig. 5.1). Genotypic diversity statistics (Fig. 5.2, top six panels) showed a predictable pattern of genotypic diversity increasing with sexual reproduction with diversity being higher on average for clonal populations with uneven mutation rates as compared to even mutation rates. Allelic diversity statistics (Fig. 5.2, bottom four panels) showed few differences in means due to mutation rate, but \(E_{5A}\) showed consistent differences in rate of sexual reproduction and sample size. Both \(E_{5A}\) and \(H_{exp}\) showed higher values for populations where the rate of sexual reproduction is < 0.1%. Not shown in the graph is a bimodal pattern in the \(H_{exp}\) distributions with a rate of sexual reproduction < 0.1%.

The power (p <= 0.01) to detect non-random mating decreases significantly at low levels of sexual reproduction (Fig. 5.3). This is further affected by both mutation rate, sample size, and clone correction. Clone correction, however, appears to affect even mutation rates more strongly than uneven mutation rates. For all scenarios the power to detect non-random mating drops to below 0.25 with rates of sexual reproduction greater than 1%. This trend is only moderately improved when the threshold is raised to p <= 0.05 (data not shown). There is less power to detect non-random mating at 0% and 0.01% sex as compared to 0.05% and 0.1% sex. This appears to be due to the observed negative values of \(\bar{r}_d\) that result in a p-value of 1. At p <= 0.01, we observe a false positive rate of 1.5% in the worst case scenario.

The ROC and AURC analyses showed a steady decrease in the ability to detect non- random mating with increasing levels of sexual reproduction, mainly due to a loss in sensitivity with increasing levels of \(\alpha\) (Fig. 5.4, 5.5). As seen in the power analysis (Fig. 5.3), clone correction consistently appears to lower the power to detect linkage based on the AURC. For all scenarios, the diagnostic power of \(\bar{r}_d\) to detect non-random mating is no better than random at rates of sexual reproduction greater than 10%.

A three-way ANOVA adds support for the hypothesis that clone correction significantly affects the diagnostic power of \(\bar{r}_d\) (Fig. 5.6, Table 5.2). While there are no significant effects at 50% sex, clone correction, mutation rate, and sample size all have a significant effect on \(\bar{r}_d\)’s diagnostic properties. At low levels of sex (< 1%), the interaction of clone correction and mutation rate shows a significant effect, which is not seen in the interaction of clone correction and sample size. Only at sex rates of 0.1% and 0.05% do we see significant effects across all interactions.

sexrate | term | sumsq | df | statistic | p.value |
---|---|---|---|---|---|

0.0000 | (Intercept) | 63.936 | 1 | 10569.674 | 0.000 |

CC | 0.855 | 1 | 141.309 | 0.000 | |

SS | 0.060 | 3 | 3.289 | 0.020 | |

MR | 0.556 | 1 | 91.913 | 0.000 | |

CC X SS | 0.001 | 3 | 0.049 | 0.985 | |

CC X MR | 0.174 | 1 | 28.747 | 0.000 | |

SS X MR | 0.002 | 3 | 0.100 | 0.960 | |

CC X SS X MR | 0.000 | 3 | 0.006 | 0.999 | |

Residuals | 9.582 | 1584 | NA | NA | |

0.0001 | (Intercept) | 70.972 | 1 | 14652.094 | 0.000 |

CC | 0.496 | 1 | 102.400 | 0.000 | |

SS | 0.027 | 3 | 1.886 | 0.130 | |

MR | 0.340 | 1 | 70.257 | 0.000 | |

CC X SS | 0.009 | 3 | 0.596 | 0.617 | |

CC X MR | 0.076 | 1 | 15.698 | 0.000 | |

SS X MR | 0.001 | 3 | 0.077 | 0.972 | |

CC X SS X MR | 0.003 | 3 | 0.241 | 0.868 | |

Residuals | 7.673 | 1584 | NA | NA | |

0.0005 | (Intercept) | 91.939 | 1 | 245382.751 | 0.000 |

CC | 0.072 | 1 | 191.181 | 0.000 | |

SS | 0.089 | 3 | 78.753 | 0.000 | |

MR | 0.050 | 1 | 132.414 | 0.000 | |

CC X SS | 0.047 | 3 | 42.205 | 0.000 | |

CC X MR | 0.027 | 1 | 72.663 | 0.000 | |

SS X MR | 0.035 | 3 | 31.105 | 0.000 | |

CC X SS X MR | 0.018 | 3 | 15.818 | 0.000 | |

Residuals | 0.593 | 1584 | NA | NA | |

0.0010 | (Intercept) | 86.202 | 1 | 157278.033 | 0.000 |

CC | 0.224 | 1 | 408.905 | 0.000 | |

SS | 0.269 | 3 | 163.853 | 0.000 | |

MR | 0.113 | 1 | 206.697 | 0.000 | |

CC X SS | 0.135 | 3 | 82.053 | 0.000 | |

CC X MR | 0.060 | 1 | 109.071 | 0.000 | |

SS X MR | 0.067 | 3 | 40.994 | 0.000 | |

CC X SS X MR | 0.034 | 3 | 20.426 | 0.000 | |

Residuals | 0.868 | 1584 | NA | NA | |

0.0050 | (Intercept) | 52.266 | 1 | 14127.572 | 0.000 |

CC | 0.936 | 1 | 253.111 | 0.000 | |

SS | 2.592 | 3 | 233.531 | 0.000 | |

MR | 0.138 | 1 | 37.393 | 0.000 | |

CC X SS | 0.234 | 3 | 21.126 | 0.000 | |

CC X MR | 0.028 | 1 | 7.675 | 0.006 | |

SS X MR | 0.026 | 3 | 2.385 | 0.068 | |

CC X SS X MR | 0.022 | 3 | 1.974 | 0.116 | |

Residuals | 5.860 | 1584 | NA | NA | |

0.0100 | (Intercept) | 39.376 | 1 | 5724.212 | 0.000 |

CC | 0.686 | 1 | 99.672 | 0.000 | |

SS | 2.455 | 3 | 118.955 | 0.000 | |

MR | 0.068 | 1 | 9.817 | 0.002 | |

CC X SS | 0.173 | 3 | 8.380 | 0.000 | |

CC X MR | 0.003 | 1 | 0.378 | 0.539 | |

SS X MR | 0.051 | 3 | 2.488 | 0.059 | |

CC X SS X MR | 0.056 | 3 | 2.720 | 0.043 | |

Residuals | 10.896 | 1584 | NA | NA | |

0.0500 | (Intercept) | 28.671 | 1 | 1903.303 | 0.000 |

CC | 0.072 | 1 | 4.806 | 0.029 | |

SS | 0.239 | 3 | 5.293 | 0.001 | |

MR | 0.001 | 1 | 0.080 | 0.778 | |

CC X SS | 0.391 | 3 | 8.646 | 0.000 | |

CC X MR | 0.000 | 1 | 0.032 | 0.858 | |

SS X MR | 0.037 | 3 | 0.809 | 0.489 | |

CC X SS X MR | 0.001 | 3 | 0.028 | 0.994 | |

Residuals | 23.861 | 1584 | NA | NA | |

0.1000 | (Intercept) | 26.010 | 1 | 1441.495 | 0.000 |

CC | 0.009 | 1 | 0.524 | 0.469 | |

SS | 0.131 | 3 | 2.419 | 0.065 | |

MR | 0.000 | 1 | 0.001 | 0.979 | |

CC X SS | 0.162 | 3 | 2.984 | 0.030 | |

CC X MR | 0.000 | 1 | 0.013 | 0.908 | |

SS X MR | 0.014 | 3 | 0.256 | 0.857 | |

CC X SS X MR | 0.001 | 3 | 0.015 | 0.998 | |

Residuals | 28.581 | 1584 | NA | NA | |

0.5000 | (Intercept) | 25.669 | 1 | 1489.342 | 0.000 |

CC | 0.001 | 1 | 0.033 | 0.857 | |

SS | 0.042 | 3 | 0.806 | 0.491 | |

MR | 0.002 | 1 | 0.099 | 0.753 | |

CC X SS | 0.001 | 3 | 0.025 | 0.995 | |

CC X MR | 0.000 | 1 | 0.006 | 0.939 | |

SS X MR | 0.004 | 3 | 0.076 | 0.973 | |

CC X SS X MR | 0.001 | 3 | 0.011 | 0.998 | |

Residuals | 27.301 | 1584 | NA | NA |

### 5.4.3 Power To Detect Clonal Reproduction in SNP Data

Because of the computational intensity of the simulations, we were only able to run 24 unique seeds over all rates of sexual reproduction with 10 replicates per seed. Due to an unexpected corner condition, some replicates failed to save, so we randomly sampled five replicates per seed per rate of sexual reproduction, giving us a total of 1,200 total populations for analysis.

Analysis of \(\bar{r}_d\) for SNP data showed a similar pattern to the microsatellite data where clonal and nearly clonal (0.1% sexual reproduction) distributions of \(\bar{r}_d\) were bimodal, but they did not have as many significantly negative values (Fig. 5.10). We observed that that none of the values of \(\bar{r}_d\) for the sexual population reached zero. Increasing levels of missing data appeared to consistently decrease the value of \(\bar{r}_d\) (5.11). This effect was magnified when missing data was treated as a new allele.

We initially randomized all loci independently to test for significant departures from random mating. Our results from that analysis showed *p = 0.001* for all but two clonal data sets (data not shown). After shuffling each chromosome independently, we obtained significance values that better reflected our observations from the microsatellite data. Because of the dearth of significantly negative \(\bar{r}_d\) values in clonal populations, the power of the index to detect clonal reproduction was greater, but at the cost of an increase in false positive rates (Fig. 5.12). An ANOVA analysis of the distributions of ROC over the 24 seeds showed significant differences for 0.5% to 10% sexual reproduction (Fig. 5.13, Tab. 5.4).

sexrate | term | sumsq | df | statistic | p.value |
---|---|---|---|---|---|

0.0000 | (Intercept) | 22.932 | 1 | 17347.604 | 0.000 |

SS | 0.004 | 3 | 1.121 | 0.345 | |

Residuals | 0.122 | 92 | NA | NA | |

0.0001 | (Intercept) | 23.602 | 1 | 29949.701 | 0.000 |

SS | 0.001 | 3 | 0.352 | 0.787 | |

Residuals | 0.072 | 92 | NA | NA | |

0.0005 | (Intercept) | 23.562 | 1 | 28317.512 | 0.000 |

SS | 0.001 | 3 | 0.339 | 0.797 | |

Residuals | 0.077 | 92 | NA | NA | |

0.0010 | (Intercept) | 23.562 | 1 | 28317.512 | 0.000 |

SS | 0.001 | 3 | 0.339 | 0.797 | |

Residuals | 0.077 | 92 | NA | NA | |

0.0050 | (Intercept) | 19.117 | 1 | 6103.399 | 0.000 |

SS | 0.174 | 3 | 18.568 | 0.000 | |

Residuals | 0.288 | 92 | NA | NA | |

0.0100 | (Intercept) | 13.984 | 1 | 1590.890 | 0.000 |

SS | 0.806 | 3 | 30.552 | 0.000 | |

Residuals | 0.809 | 92 | NA | NA | |

0.0500 | (Intercept) | 6.510 | 1 | 232.786 | 0.000 |

SS | 1.340 | 3 | 15.972 | 0.000 | |

Residuals | 2.573 | 92 | NA | NA | |

0.1000 | (Intercept) | 5.920 | 1 | 155.018 | 0.000 |

SS | 0.575 | 3 | 5.023 | 0.003 | |

Residuals | 3.514 | 92 | NA | NA | |

0.5000 | (Intercept) | 6.161 | 1 | 169.859 | 0.000 |

SS | 0.059 | 3 | 0.542 | 0.655 | |

Residuals | 3.337 | 92 | NA | NA |

### 5.4.4 Common Features of Microsatellite and SNP Markers

For both microsatellite and SNP data, as found in de Meeûs & Balloux (2004), large variances were observed with extremely low levels of sexual reproduction (Figs. 5.1, 5.10). The variances and means of \(\bar{r}_d\) decrease with increasing levels of sexual reproduction. For low levels of sexual reproduction (< 0.01%), the distributions appeared bimodal. This feature appears in both the microsatellite and SNP data, but with microsatellite data, the lower part of the distribution extends into the negative values of \(\bar{r}_d\), resulting in non-significant p-values, which occur when low allelic diversity is observed. Additionally, with larger sample sizes, the variances decrease (Fig. 5.1, 5.10) and the power to detect clonal reproduction increases (5.7, 5.12), suggesting that the behavior of the index is affected by sample size. Based on the power analysis, we observe that the critical point to detect clonal reproduction in a population is at a maximum of 1% sexual reproduction (Fig.5.3, 5.7, 5.12).

## 5.5 Discussion

Scientists have used the index of association to provide evidence for clonal reproduction in populations. The index of association measure the ratio of variance between observed and expected genetic distance and is a measure of linkage among markers (equation (5.1) ) (Brown et al., 1980; Smith et al., 1993). Agapow & Burt (2001) created a standardized version, \(\bar{r}_d\), that resembles a correlation coefficient (equation (5.2)). De Meeûx & Balloux (2004) previously showed that very little sexual reproduction is required for \(\bar{r}_d\) to reach values close to zero (e.g. the absence of linkage among markers). We confirmed that the power to detect clonal reproduction is greatly diminished after \(\sim\) 1% sex in a population. We also provide novel insights into how \(\bar{r}_d\) responds to marker type, sample size, mutation rate homogeneity, and clone-correction. We found that the power to detect significant departures of \(\bar{r}_d\) from expected under random mating decreases with small sample sizes, and thus also decreases with clone-correction as sample size is by definition reduced when clone-correcting. More importantly, we found that linkage depends critically on allelic diversity at low levels of sexual reproduction (< 0.1%) as discussed below.

### 5.5.1 Clone Correction Increases False Negatives

While the presence of over-represented genotypes is often an indication of clonal reproduction, it does not explicitly rule out sexual reproduction (Smith et al., 1993; Taylor et al., 1999; Tibayrenc, 1996). This can happen in an ‘epidemic’ population structure where there is a sudden expansion of one genotype in an otherwise panmictic population (Smith et al., 1993). Thus, a common recommendation to researchers is to conduct analyses on both clone-corrected and uncorrected data with the expectation that this will reveal any underlying recombination (McDonald, 1997; Milgroom, 1996; Smith et al., 1993). When analyzing \(\bar{r}_d\), common wisdom says that if the result is significant with and without clone-correction, then there is strong evidence for clonal reproduction. This technique was used by Goss et al. to support the hypothesis that South American populations of the potato late blight pathogen, *Phytophthora infestans* were reproducing clonally (2014). Our findings show that a non-significant value does not necessarily indicate the absence of clonal reproduction since the power to detect clonal reproduction consistently decreases with clone-correction(Fig. 5.5, 5.8). Our results confirm the concerns raised in Tibayrenc (1996) of the increase of false negatives due to clone- correction, further highlighting the fact that a failure to reject the null hypothesis of random mating does not necessarily mean that the population is randomly mating (Milgroom, 1996; Smith et al., 1993; Tibayrenc, 1996). Clone-correction thus should be interpreted carefully in light of the fact that clone-correction reduces sample size and power is significantly reduced with small sample sizes.

### 5.5.2 Mutation Rates and \(\bar{r}_d\)

We assessed the effect of uneven mutation rates across loci, because we had observed that a single, hyper-variable locus with 18 alleles was the driver for most of the diversity observed in a previous study on the population dynamics of *Phytophthora ramorum* in Curry County, OR (Kamvar et al. (2015c), Supplementary Table S4). We hypothesized that the high variability would reduce linkage and thus, reduce the power of \(\bar{r}_d\) to detect clonal reproduction. In contrast, we observed the opposite result; the power to detect clonal reproduction increased with the presence of a single, hyper-variable locus (Fig. 5.3, 5.4). This also appears to make the effects of clone-correction less severe. We believe that this is most likely due to the observed increase in genotypic diversity, which causes an increased sample size with clone-correction (Fig. 5.2).

### 5.5.3 Allelic Diversity Influences the value of \(\bar{r}_d\) in Clonal Populations

A novel feature, not previously observed in de Meeûs & Balloux (2004), was the observation of negative values of \(\bar{r}_d\) (Fig. 5.1). While positive values of \(\bar{r}_d\) are easy to interpret (the observed variance is greater than expected, suggesting non-random mating), negative values are often simply considered to be evidence for random mating. Significantly (p = 1) negative values of \(\bar{r}_d\) have been observed for pathogens that are well studied and known to be clonal such as the sudden oak death pathogen, *Phytophthora ramorum* (Kamvar et al. (2015c), Supplementary Table S3) and the wheat stripe rust pathogen *Puccinia graminis* f.sp. *tritici* (Les Szabo, pers. comm.). A common pattern in all of these populations is that they all had values of \(E_{5A}\) > 0.9 and, on average, fewer than 4 alleles per locus. This pattern of \(E_{5A}\) is reflected in our simulations (Fig. 5.2). When we set a boundary condition for our power analysis based on evenness \(\{p \leq \alpha \mid E_{5A} \geq 0.85\}\), we found that this reduced the number of false negatives without significantly contributing to the number of false positives (Fig. 5.7, 5.8). Levels of \(E_{5A} > 0.85\) also had a low number of allelic states (data not shown), which indicates that the bimodal distribution of \(\bar{r}_d\) for clonal populations is driven by allelic diversity. Thus, including allelic evenness can serve as a diagnostic parameter when interpreting linkage at low levels of sexual reproduction under conditions of low allelic diversity.

### 5.5.4 Linkage in SNP Data Increases \(\bar{r}_d\) Unilaterally

The results we obtained from SNP data revealed almost no values of \(\bar{r}_d\) at or below zero (Fig. 5.10). Testing for significance by shuffling data at each locus independently rendered all the values of \(\bar{r}_d\) significant, regardless of rate of sexual reproduction. Only when we shuffled linkage groups were we able to obtain the same level of significance as compared to the microsatellite data. This indicates that it is possible to detect clonal reproduction with the same power as microsatellite markers even with physically linked markers. Great care, however must be taken to ensure that linkage blocks are shuffled together or that SNP markers are not linked before assessing significance of a value of \(\bar{r}_d\). In contrast to the microsatellite data, sample size did not significantly affect power with high levels of clonal reproduction (Fig 5.13, Tab: 5.4).

### 5.5.5 Limitations of This Study

The scope of simulation studies are limited to the parameters used to generate the simulations. We do not know, for example, what effect selection, population subdivision, or polyploidy would have on inference of \(\bar{r}_d\). While we were able to directly compare our populations with even and uneven mutation rates, because they both represent the exact same populations, we do not know if the effect we see is due to a higher average mutation rate, thus further study is needed to detect the effect of mutation rate in clonal populations. In contrast to de Meeûs & Balloux (2004), whereas their simulations had 99 alleles per locus, our simulations were designed with a small number of alleles per locus to better reflect microsatellite markers that would be selected for typical population genetic analyses. Because of this, there was a higher chance of homoplasy in our simulations. This, however was most likely beneficial to our analysis as our results show that low allelic diversity can lead to negative \(\bar{r}_d\) values, which we have previously observed in data from purely clonal organisms.

### 5.5.6 Conclusions

The index of association is a valuable measure for lending support to the hypothesis of clonal reproduction in natural populations. We confirm that \(\bar{r}_d\) can provide evidence for clonal reproduction when there is < 1% sexual reproduction within the population (de Meeûs & Balloux, 2004), but conclude that this inference is dependent on sample size and allelic diversity. Thus scientists should exercise caution when interpreting results from a clone-corrected data set. We found that this index is robust enough to detect the same level of clonal reproduction for SNP data, but results from permutation analyses may be misleading if linkage is present among markers.

### References

Hartl, D. L., & Clark, A. G. (2007). *Principles of population genetics*. Sunderland, MA, USA: Sinauer Associates.

Nielsen, R., & Slatkin, M. (2013). *An introduction to population genetics: Theory and applications*. Sinauer Associates, Incorporated. Retrieved from http://books.google.com/books?id=Iy08kgEACAAJ

Milgroom, M. G. (1996). Recombination and the multilocus structure of fungal populations. *Annual Review of Phytopathology*, *34*(1), 457–477.

Orive, M. E. (1993). Effective population size in organisms with complex life-histories. *Theoretical Population Biology*, *44*(3), 316–340. https://doi.org/10.1006/tpbi.1993.1031

Heitman, J., Sun, S., & James, T. Y. (2012). Evolution of fungal sexual reproduction. *Mycologia*, *105*(1), 1–27. https://doi.org/10.3852/12-253

Tibayrenc, M. (1996). Towards a unified evolutionary genetics of microorganisms. *Annual Review of Microbiology*, *50*(1), 401–429. https://doi.org/10.1146/annurev.micro.50.1.401

de Meeûs, T., Lehmann, L., & Balloux, F. (2006). Molecular epidemiology of clonal diploids: A quick overview and a short DIY (do it yourself) notice. *Infection, Genetics and Evolution*, *6*(2), 163–170. https://doi.org/10.1016/j.meegid.2005.02.004

Goss, E. M., Tabima, J. F., Cooke, D. E., Restrepo, S., Fry, W. E., Forbes, G. A., Fieland, V. J., Cardenas, M., & Grünwald, N. J. (2014). The Irish potato famine pathogen *Phytophthora infestans* originated in central Mexico rather than the Andes. *Proceedings of the National Academy of Sciences*, *111*(24), 8791–8796.

Nieuwenhuis, B. P. S., & James, T. Y. (2016). The frequency of sex in fungi. *Philosophical Transactions of the Royal Society B: Biological Sciences*, *371*(1706), 20150540. https://doi.org/10.1098/rstb.2015.0540

Smith, J. M., Smith, N. H., O’Rourke, M., & Spratt, B. G. (1993). How clonal are bacteria? *Proceedings of the National Academy of Sciences*, *90*(10), 4384–4388. https://doi.org/10.1073/pnas.90.10.4384

Ali, S., Soubeyrand, S., Gladieux, P., Giraud, T., Leconte, M., Gautier, A., Mboup, M., Chen, W., Vallavieille-Pope, C., & Enjalbert, J. (2016). Cloncase: Estimation of sex frequency and effective population size by clonemate resampling in partially clonal organisms. *Molecular Ecology Resources*. https://doi.org/10.1111/1755-0998.12511

Balloux, F., Lehmann, L., & de Meeûs, T. (2003). The population genetics of clonal and partially clonal diploids. *Genetics*, *164*(4), 1635–1644.

de Meeûs, T., & Balloux, F. (2004). Clonal reproduction and linkage disequilibrium in diploids: A simulation study. *Infection, Genetics and Evolution*, *4*(4), 345–351. https://doi.org/10.1016/j.meegid.2004.05.002

Agapow, P.-M., & Burt, A. (2001). Indices of multilocus linkage disequilibrium. *Molecular Ecology Notes*, *1*(1-2), 101–102. https://doi.org/10.1046/j.1471-8278.2000.00014.x

Brown, A., Feldman, M., & Nevo, E. (1980). Multilocus structure of natural populations of *Hordeum spontaneum*. *Genetics*, *96*(2), 523–536. Retrieved from http://www.genetics.org/content/96/2/523.abstract

Haubold, B., Travisano, M., Rainey, P. B., & Hudson, R. R. (1998). Detecting linkage disequilibrium in bacterial populations. *Genetics*, *150*(4), 1341–8.

Kamvar, Z. N., Tabima, J. F., & Grünwald, N. J. (2014b). Poppr : an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. *PeerJ*, *2*, e281. https://doi.org/10.7717/peerj.281

Milgroom, M. G. (2015). *Population biology of plant pathogens: Genetics, ecology, and evolution*. 3340 Pilot Knob Road, St. Paul, MN 55121, USA: APS Press.

Prugnolle, F., & de Meeûs, T. (2010). Apparent high recombination rates in clonal parasitic organisms due to inappropriate sampling design. *Heredity*, *104*(2), 135–140. https://doi.org/10.1038/hdy.2009.128

McDonald, B. A. (1997). The population genetics of fungi: tools and techniques. *Phytopathology*, *87*(4), 448–453.

Davey, J. W., & Blaxter, M. L. (2010). RADSeq: next-generation population genetics. *Briefings in Functional Genomics*, *9*(5-6), 416–423. https://doi.org/10.1093/bfgp/elq031

Davey, J. W., Hohenlohe, P. A., Etter, P. D., Boone, J. Q., Catchen, J. M., & Blaxter, M. L. (2011). Genome-wide genetic marker discovery and genotyping using next-generation sequencing. *Nature Reviews Genetics*, *12*(7), 499–510. https://doi.org/10.1038/nrg3012

Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A., Kawamoto, K., Buckler, E. S., & Mitchell, S. E. (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. *PLoS ONE*, *6*(5), e19379. https://doi.org/10.1371/journal.pone.0019379

Mastretta-Yanes, A., Arrigo, N., Alvarez, N., Jorgensen, T. H., Piñero, D., & Emerson, B. C. (2014). Restriction site-associated DNA sequencing, genotyping error estimation and *de novo* assembly optimization for population genetic inference. *Molecular Ecology Resources*, *15*(1), 28–41. https://doi.org/10.1111/1755-0998.12291

Kamvar, Z. N., Brooks, J. C., & Grünwald, N. J. (2015a). Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. *Frontiers in Genetics*, *6*. https://doi.org/10.3389/fgene.2015.00208

R Core Team. (2016). *R: A language and environment for statistical computing*. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/

Peng, B., & Amos, C. I. (2008). Forward-time simulations of non-random mating populations using simuPOP. *Bioinformatics*, *24*(11), 1408–1409. https://doi.org/10.1093/bioinformatics/btn179

Lynch, M., Sung, W., Morris, K., Coffey, N., Landry, C. R., Dopman, E. B., Dickinson, W. J., Okamoto, K., Kulkarni, S., Hartl, D. L., & Thomas, W. K. (2008). A genome-wide view of the spectrum of spontaneous mutations in yeast. *Proceedings of the National Academy of Sciences*, *105*(27), 9272–9277. https://doi.org/10.1073/pnas.0803466105

Shannon, C. E. (1948). A mathematical theory of communication. *ACM SIGMOBILE Mobile Computing and Communications Review*, *5*(1), 3–55.

Stoddart, J. A., & Taylor, J. F. (1988). Genotypic diversity: Estimation and prediction in samples. *Genetics*, *118*(4), 705–11.

Pielou, E. (1975). *Ecological diversity*. New York: Wiley & Sons.

Nei, M. (1978). Estimation of average heterozygosity and genetic distance from a small number of individuals. *Genetics*, *89*(3), 583–590. Retrieved from http://www.genetics.org/content/89/3/583.abstract

Grünwald, N. J., Goodwin, S. B., Milgroom, M. G., & Fry, W. E. (2003). Analysis of genotypic diversity data for populations of microorganisms. *Phytopathology*, *93*(6), 738–746. https://doi.org/10.1094/phyto.2003.93.6.738

Kamvar, Z. N., Brooks, J. C., & Grünwald, N. J. (2015b, May). Supplementary Material for Frontiers Plant Genetics and Genomics “Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality”. ZENODO. https://doi.org/10.5281/zenodo.17424

Metz, C. E. (1978). Basic principles of ROC analysis. In *Seminars in nuclear medicine* (Vol. 8, pp. 283–298). Elsevier. https://doi.org/10.1016/S0001-2998(78)80014-2

Jurasinski, G., Koebsch, F., Guenther, A., & Beetz, S. (2014). *flux: Flux rate calculation from dynamic closed chamber measurements*. Retrieved from https://CRAN.R-project.org/package=flux

Taylor, J. W., Geiser, D. M., Burt, A., & Koufopanou, V. (1999). The evolutionary biology and population genetics underlying fungal strain typing. *Clinical Microbiology Reviews*, *12*(1), 126–146.

Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2015c). Spatial and temporal analysis of populations of the sudden oak death pathogen in oregon forests. *Phytopathology*, *105*(7), 982–989. https://doi.org/10.1094/phyto-12-14-0350-fi