Chapter 3 Novel R Tools For Analysis of Genome-Wide Population Genetic Data With Emphasis on Clonality

Zhian N. Kamvar, Jonah C. Brooks, and Niklaus J. Grünwald

Journal: Frontiers in Genetics

EPFL Innovation Park, Building I, CH – 1015 Lausanne Switzerland

Published 2015-06-10. Issue: 6, DOI: 10.3389/fgene.2015.00208

3.1 Abstract

To gain a detailed understanding of how plant microbes evolve and adapt to hosts, pesticides, and other factors, knowledge of the population dynamics and evolutionary history of populations is crucial. Plant pathogen populations are often clonal or partially clonal which requires different analytical tools. With the advent of high throughput sequencing technologies, obtaining genome-wide population genetic data has become easier than ever before. We previously contributed the R package poppr specifically addressing issues with analysis of clonal populations. In this paper we provide several significant extensions to poppr with a focus on large, genome-wide SNP data. Specifically, we provide several new functionalities including the new function mlg.filter to define clone boundaries allowing for inspection and definition of what is a clonal lineage, minimum spanning networks with reticulation, a sliding-window analysis of the index of association, modular bootstrapping of any genetic distance, and analyses across any level of hierarchies.

3.2 Introduction

To paraphrase Dobzhansky, nothing in the field of plant-microbe interactions makes sense except in the light of population genetics (Dobzhansky, 1973). Genetic forces such as selection and drift act on alleles in a population. Thus, a true understanding of how plant pathogens emerge, evolve and adapt to crops, fungicides, or other factors, can only be elucidated in the context of population level phenomena given the demographic history of populations (Grünwald & Goss, 2011; McDonald & Linde, 2002 b; Milgroom et al., 1989). The field of population genetics, in the era of whole genome resequencing, provides unprecedented power to describe the evolutionary history and population processes that drive coevolution between pathogens and hosts. This powerful field thus critically enables effective deployment of R genes, design of pathogen informed plant resistance breeding programs, and implementation of fungicide rotations that minimize emergence of resistance.

Most computational tools for population genetics are based on concepts developed for sexual model organisms. Populations that reproduce clonally or are polyploid are thus difficult to characterize using classical population genetic tools because theoretical assumptions underlying the theory are violated. Yet, many plant pathogen populations are at least partially clonal if not completely clonal (Anderson & Kohn, 1995; Milgroom, 1996). Thus, development of tools for analysis of clonal or polyploid populations is needed.

Genotyping by sequencing and whole genome resequencing provide the unprecedented ability to identify thousands of single nucleotide polymorphisms (SNPs) in populations (Davey et al., 2011; Elshire et al., 2011; Luikart et al., 2003). With traditional marker data (e.g., SSR, AFLP) a clone was typically defined as a unique multilocus genotype (MLG) (Cooke et al., 2012; Falush et al., 2003; Goss et al., 2009; Grünwald & Hoheisel, 2006; Taylor & Fisher, 2003). Availability of large SNP data sets provides new challenges for data analysis. These data are based on reduced representation libraries and high throughput sequencing with moderate sequencing depth which invariably results in substantial missing data, error in SNP calling due to sequencing error, lack of read depth or other sources of spurious allele calls (Mastretta-Yanes et al., 2014). It is thus not clear what a clone is in large SNP data sets and novel tools are required for definition of clone boundaries.

The research community using the R statistical and computing language (R Core Team, 2015) has developed a plethora of new resources for population genetic analysis. R is particularly appealing because all code is open source and functions can be evaluated and modified by any user. Recently, we introduced the R package poppr specifically developed for analysis of clonal populations (Kamvar et al., 2014 b). Poppr previously introduced several novel features including the ability to conduct a hierarchical analysis across unlimited hierarchies, test for linkage association, graph minimum spanning networks or provide bootstrap support for Bruvo’s distance in resulting trees. Poppr has been rapidly adopted and applied to a range of studies including for example horizontal transmission in leukemia of clams (Metzger et al., 2015), study of the vector-mediated parent-to-offspring transmission in an avian malaria-like parasite (Chakarov et al., 2015), and characterization of the emergence of the invasive forest pathogen Hymenoscyphus pseudoalbidus (Gross et al., 2014). It has also been used to implement real-time, online R based tools for visualizing relationships among unknown MLGs in reference databases
(http://phytophthora-id.org/) (Grünwald et al., 2011).

Here, we introduce poppr 2.0, which provides a major update to poppr (Kamvar et al., 2014 b) including novel tools for analysis of clonal populations specifically addressing large SNP data. Significant novel tools include functions for calculating clone boundaries and collapsing individuals into clonal groups based on a user-specified genetic distance threshold, sliding window analyses, genotype accumulation curves, reticulations in minimum spanning networks, and bootstrapping for any genetic distance.

3.3 Implementations and Examples

3.3.1 Clonal identification

As highlighted in previous work, clone correction is an important component of population genetic analysis of organisms that are known to reproduce asexually (Grünwald et al., 2003; Kamvar et al., 2014 b; Milgroom, 1996). This method is a partial correction for bias that affects metrics that rely on allele frequencies assuming panmixia and was initially designed for data with only a handful of markers. With the advent of large-scale sequencing and reduced- representation libraries, it has become easier to sequence tens of thousands of markers from hundreds of individuals (Davey & Blaxter, 2010; Davey et al., 2011; Elshire et al., 2011). With this larger number of markers, the genetic resolution is much greater, but the chance of genotyping error is also greatly increased and missing data is frequent (Mastretta-Yanes et al., 2014). Taking this fact and occasional somatic mutations into account, it would be impossible to separate true clones from independent individuals by just comparing what MLGs are different. We introduce a new method for collapsing unique multilocus genotypes determined by naive string comparison into multilocus lineages utilizing any genetic distance given three different clustering algorithms: farthest neighbor, nearest neighbor, and UPGMA (average neighbor) (Sokal, 1958).

These clustering algorithms act on a distance matrix that is either provided by the user or generated via a function that will calculate a distance from genetic data such as bruvo.dist, which in particular applies to any level of ploidy (Bruvo et al., 2004). All algorithms have been implemented in C and utilize the OpenMP framework for optional parallel processing (Dagum & Menon, 1998). Default is the conservative farthest neighbor algorithm (Fig. 3.1A), which will only cluster samples together if all samples in the cluster are at a distance less than the given threshold. By contrast, the nearest neighbor algorithm will have a chaining effect that will cluster samples akin to adding links on a chain where a sample can be included in a cluster if all of the samples have at least one connection below a given threshold (Fig. 3.1C). The UPGMA, or average neighbor clustering algorithm is the one most familiar to biologists as it is often used to generate ultra-metric trees based on genetic distance (Fig. 3.1B). This algorithm will cluster by creating a representative sample per cluster and joining clusters if these representative samples are closer than the given threshold.

Diagrammatic representation of the three clustering algorithms
implemented in `mlg.filter`. (**A-C**) Represent different clustering
algorithms on the same imaginary network with a threshold of 0.451. Edge
weights are represented in arbitrary units noted by the line thickness
and numerical values next to the lines. All outer angles are 90 degrees,
so the un-labeled edge weights can be obtained with simply geometry.
Colored circles represent clusters of genotypes. (**A**) Farthest
neighbor clustering does not cluster nodes B and C because nodes A and C
are more than a distance of 0.451 apart. (**B**) UPGMA (average
neighbor) clustering clusters nodes A, B, and C together because the
average distance between them and C is < 0.451. (**C**) Nearest
neighbor clustering clusters all nodes together because the minimum
distance between them is always < 0.451.

Figure 3.1: Diagrammatic representation of the three clustering algorithms implemented in mlg.filter. (A-C) Represent different clustering algorithms on the same imaginary network with a threshold of 0.451. Edge weights are represented in arbitrary units noted by the line thickness and numerical values next to the lines. All outer angles are 90 degrees, so the un-labeled edge weights can be obtained with simply geometry. Colored circles represent clusters of genotypes. (A) Farthest neighbor clustering does not cluster nodes B and C because nodes A and C are more than a distance of 0.451 apart. (B) UPGMA (average neighbor) clustering clusters nodes A, B, and C together because the average distance between them and C is < 0.451. (C) Nearest neighbor clustering clusters all nodes together because the minimum distance between them is always < 0.451.

We utilize data from the microbe Phytophthora infestans to show how the mlg.filter function collapses multilocus genotypes with Bruvo’s distance assuming a genome addition model (Bruvo et al., 2004). P. infestans is the causal agent of potato late blight originating from Mexico that spread to Europe in the mid 19th century (Goss et al., 2014; Yoshida et al., 2013). P. infestans reproduces both clonally and sexually. The clonal lineages of P. infestans have been formally defined into 18 separate clonal lineages using a combination of various molecular methods including AFLP and microsatellite markers (Lees et al., 2006; Li et al., 2013). For these data, we used mlg.filter to detect all of the distance thresholds at which 18 multilocus lineages would be resolved. We used these thresholds to define multilocus lineages and create contingency tables and dendrograms to determine how well the multilocus lineages were detected.

For the P. infestans population, the three algorithms were able to detect 18 multilocus lineages at different distance thresholds (Fig. 3.2). Contingency tables between the described multilocus genotypes and the genotypes defined by distance show that most of the 18 lineages were resolved, except for US-8, which is polytomic (Table 3.1).

Graphical representation of three different clustering algorithms
collapsing multilocus genotypes for 12 SSR loci from *Phytophthora
infestans* representing 18 clonal lineages. The horizontal axis is
Bruvo's genetic distance assuming the genome addition model. The
vertical axis represents the number of multilocus lineages observed.
Each point shows the threshold at which one would observe a given number
of multilocus genotypes. The horizontal black line represents 18
multilocus genotypes and vertical dashed lines mark the thresholds used
to collapse the multilocus genotypes into 18 multilocus lineages.

Figure 3.2: Graphical representation of three different clustering algorithms collapsing multilocus genotypes for 12 SSR loci from Phytophthora infestans representing 18 clonal lineages. The horizontal axis is Bruvo’s genetic distance assuming the genome addition model. The vertical axis represents the number of multilocus lineages observed. Each point shows the threshold at which one would observe a given number of multilocus genotypes. The horizontal black line represents 18 multilocus genotypes and vertical dashed lines mark the thresholds used to collapse the multilocus genotypes into 18 multilocus lineages.

Table 3.1: Contingency table comparing multilocus lineages (MLL) defined in Li et al. (2013) and Lees et al. (2006) (rows) to MLLs inferred from Bruvo’s genetic distance (columns) at a threshold of 0.07 with the average neighbor algorithm (Bruvo et al., 2004; Sokal, 1958). Values in the table represent the number of times any given inferred MLL matches with a previously defined MLL. For example, in our original data set, there were three genotypes previously defined as the US-24 MLL. All three genotypes were also determined to cluster into a single MLL by filtering. In contrast, US-8 was determined to cluster into three different MLLs by filtering.
	3	4	5	6	8	10	12	15	16	17	18	20	21	22	24	25	27	28
B	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	1	.	.
C	.	.	.	.	.	.	.	.	.	.	.	.	.	.	1	.	.	.
D.1	.	.	.	.	.	.	.	.	.	.	.	.	.	1	.	.	.	.
D.2	.	.	.	.	.	.	.	.	.	.	.	.	.	1	.	.	.	.
EU-13	.	.	.	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.
EU-4	.	.	.	.	.	.	.	.	.	1	.	.	.	.	.	.	.	.
EU-5	.	.	.	.	.	.	.	.	.	.	2	.	.	.	.	.	.	.
EU-8	.	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	.
US-11	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	2
US-12	.	1	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
US-14	.	.	.	.	.	1	.	.	.	.	.	.	.	.	.	.	.	.
US-17	.	.	.	.	.	.	.	.	.	.	.	1	.	.	.	.	.	.
US-20	2	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
US-21	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	2	.
US-22	.	.	.	.	.	.	.	.	.	.	.	.	2	.	.	.	.	.
US-23	.	.	.	.	.	.	.	3	.	.	.	.	.	.	.	.	.	.
US-24	.	.	.	.	3	.	.	.	.	.	.	.	.	.	.	.	.	.
US-8	.	.	1	1	.	2	.	.	.	.	.	.	.	.	.	.	.	.

We utilized simulated data to evaluate the effect of sequencing error and missing data on MLG calling. We constructed the data using the glSim function in adegenet (Jombart & Ahmed, 2011) to obtain a SNP data set for demonstration. Two diploid data sets were created, each with 10k SNPs (25% structured into two groups) and 200 samples with 10 ancestral populations of even sizes. Clones were created in one data set by marking each sample with a unique identifier and then randomly sampling with replacement. It is well documented that reduced- representation sequencing can introduce several erroneous calls and missing data (Mastretta-Yanes et al., 2014). To reflect this, we mutated SNPs at a rate of 10% and inserted an average of 10% missing data for each sample after clones were created, ensuring that no two sequences were alike. The number of mutations and missing data per sample were determined by sampling from a Poisson distribution with $\lambda = 1000$. After pooling, 20% of the data set was randomly sampled for analysis. Genetic distance was obtained with the function bitwise.dist, which calculates the fraction of different sites between samples equivalent to Provesti’s distance, counting missing data as equivalent in comparison (Prevosti et al., 1975).

All three filtering algorithms were run with a threshold of 1, returning a numeric vector of length $n - 1$ where each element represented a threshold at which two samples/clusters would join. Since each data set would have varying distances between samples, the clonal boundary threshold was defined as the midpoint of the largest gap between two thresholds that collapsed less than 50% of the data.

Out of the 100 simulations run, we found that across all methods, detection of duplicated samples had $\sim$ 98% true positive fraction and $\sim$ 0.8% false positive fraction indicating that this method is robust to simulated populations (supplementary materials²).

3.3.2 Minimum Spanning Networks with Reticulation

In its original iteration, poppr introduced minimum spanning networks that were based on the igraph function minimum.spanning.tree (Csardi & Nepusz, 2006). This algorithm produces a minimum spanning tree with no reticulations where nodes represent individual MLGs. In other minimum spanning network programs, reticulation is obtained by calculating the minimum spanning tree several times and returning the set of all edges included in the trees. Due to the way igraph has implemented Prim’s algorithm, it is not possible to utilize this strategy, thus we implemented an internal C function to walk the space of minimum spanning trees based on genetic distance to connect groups of nodes with edges of equal weight.

To demonstrate the utility of minimum spanning networks with reticulation, we used two clonal data sets: the H3N2 flu virus data from the adegenet package using years of each epidemic as the population factor, and Phytophthora ramorum data from Nurseries and Oregon forests (Jombart et al., 2010; Kamvar et al., 2014 a). Minimum spanning networks were created with and without reticulation using the poppr functions diss.dist and bruvo.msn for the H3N2 and P. ramorum data, respectively (Bruvo et al., 2004; Kamvar et al., 2014 b). To detect mlg clusters, the infoMAP community detection algorithm was applied with 10,000 trials as implemented in the R package igraph version 0.7.1 utilizing genetic distance as edge weights and number of samples in each MLG as vertex weights (Csardi & Nepusz, 2006; Rosvall & Bergstrom, 2008).

To evaluate the results, we compared the number, size, and entropy ($H$) of the resulting communities as we expect a highly clonal organism with low genetic diversity to result in a few, large communities. We also created contingency tables of the community assignments with the defined populations and used those to calculate entropy using Shannon’s index with the function diversity from the R package vegan version 2.2-1 (Oksanen et al., 2015; Shannon, 1948). A low entropy indicates presence of a few large communities whereas high entropy indicates presence of many small communities.

The infoMAP algorithm revealed 63 communities with a maximum community size of 77 and $H = 3.56$ for the reticulate network of the H3N2 data and 117 communities with a maximum community size of 26 and $H = 4.65$ for the minimum spanning tree. The entropy across years was greatly decreased for all populations with the reticulate network compared to the minimum spanning tree (Fig. 3.3). Note that the reticulated network (Fig. 3.3B) showed patterns corresponding with those resulting from a discriminant analysis of principal components (Fig. 3.3D) (Jombart et al., 2010).

(**A-B**) Minimum spanning networks of the hemagglutinin (HA) segment of
H3N2 viral DNA from the *adegenet* package representing flu epidemics
from 2001 to 2006 without reticulation (**A**) and with reticulation
(**B**) [@Jombart_2008; @jombart2010discriminant]. Each node represents
a unique multilocus genotype, colors represent epidemic year, and edge
color represents absolute genetic distance. (**C**) Shannon entropy
values for population assignments compared with communities determined
by the infoMAP algorithm on (**A**) and (**B**). (**D**) Graphic
reproduced from @jombart2010discriminant showing that the 2006 epidemic
does not cluster neatly with the other years via Discriminant Analysis
of Principal Components. Horizontal axis represents the first
discriminant component. Vertical axis represents the second discriminant
component.

Figure 3.3: (A-B) Minimum spanning networks of the hemagglutinin (HA) segment of H3N2 viral DNA from the adegenet package representing flu epidemics from 2001 to 2006 without reticulation (A) and with reticulation (B) (Jombart, 2008; Jombart et al., 2010). Each node represents a unique multilocus genotype, colors represent epidemic year, and edge color represents absolute genetic distance. (C) Shannon entropy values for population assignments compared with communities determined by the infoMAP algorithm on (A) and (B). (D) Graphic reproduced from Jombart et al. (2010) showing that the 2006 epidemic does not cluster neatly with the other years via Discriminant Analysis of Principal Components. Horizontal axis represents the first discriminant component. Vertical axis represents the second discriminant component.

Graph walking of the reticulated minimum spanning network of P. ramorum by the infoMAP algorithm revealed 16 communities with a maximum community size of 13 and $H = 2.60$. The un-reticulated minimum spanning tree revealed 20 communities with a maximum community size of 7 and $H = 2.96$. In the ability to predict Hunter Creek as belonging to a single community, the reticulated network was successful whereas the minimum spanning tree separated one genotype from that community. The entropy for the reticulated network was lower for all populations except for the coast population (supplementary materials³).

3.3.3 Bootstrapping

Assessing population differentiation through methods such as $G_{st}$, AMOVA, and Mantel tests relies on comparing samples within and across populations (Excoffier et al., 1992; Mantel, 1967; Nei, 1973). Confidence in distance metrics is related to the confidence in the markers to accurately represent the diversity of the data. Especially true with microsatellite markers, a single hyper-diverse locus can make a population appear to have more diversity based on genetic distance. Using a bootstrapping procedure of randomly sampling loci with replacement when calculating a distance matrix provides support for clades in hierarchical clustering.

Data in genind and genpop objects are represented as matrices with individuals in rows and alleles in columns (Jombart, 2008). This gives the advantage of being able to use R’s matrix algebra capabilities to efficiently calculate genetic distance. Unfortunately, this also means that bootstrapping is a non- trivial task as all alleles at a single locus need to be sampled together. To remedy this, we have created an internal S4 class called “bootgen”, which extends the internal “gen” class from adegenet. This class can be created from any genind, genclone, or genpop object, and allows loci to be sampled with replacement. To further facilitate bootstrapping, a function called aboot, which stands for “any boot”, is introduced that will bootstrap any genclone, genind, or genpop object with any genetic distance that can be calculated from it.

To demonstrate calculating a dendrogram with bootstrap support, we used the poppr function aboot on population allelic frequencies derived from the data set microbov in the adegenet package with 1000 bootstrap replicates (Jombart, 2008; Laloë et al., 2007). The resulting dendrogram shows bootstrap support values $>50\%$ (Fig. 3.4) and used the following code:

library("poppr");
data("microbov", package = "adegenet");
strata(microbov) <- data.frame(other(microbov));
setPop(microbov) <- ~coun/spe/breed;
bov_pop <- genind2genpop(microbov);

set.seed(20150428);
pop_tree <- aboot(bov_pop, sample = 1000, cutoff = 50);

$UPGMA dendrogram generated from Nei's genetic distance on 15 breeds of *Bos taurus* (BT) or *Bos indicus* (BI) from Africa (AF) or France (FR). These data are from @laloe2007consensus. Node labels represent bootstrap support $>50\%$ out of 1,000 bootstrap replicates.$

Figure 3.4: UPGMA dendrogram generated from Nei’s genetic distance on 15 breeds of Bos taurus (BT) or Bos indicus (BI) from Africa (AF) or France (FR). These data are from Laloë et al. (2007). Node labels represent bootstrap support $>50\%$ out of 1,000 bootstrap replicates.

3.3.4 Genotype Accumulation Curve

Analysis of population genetics of clonal organisms often borrows from ecological methods such as analysis of diversity within populations (Arnaud-Hanod et al., 2007; Grünwald et al., 2003; Milgroom, 1996). When choosing markers for analysis, it is important to make sure that the observed diversity in your sample will not appreciably increase if an additional marker is added (Arnaud-Hanod et al., 2007). This concept is analogous to a species accumulation curve, obtained by rarefaction. The genotype accumulation curve in poppr is implemented in the function genotype_curve. The curve is constructed by randomly sampling $x$ loci and counting the number of observed MLGs. This repeated $r$ times for 1 locus up to $n-1$ loci, creating $n-1$ distributions of observed MLGs.

The following code example demonstrates the genotype accumulation curve for data from Everhart & Scherm (2015) showing that these data reach a small plateau and have a greatly decreased variance with 12 markers, indicating that there are enough markers such that adding more markers to the analysis will not create very many new genotypes (Fig. 3.5).

library("poppr");
library("ggplot2");
data("monpop", package = "poppr");

set.seed(20150428);
genotype_curve(monpop, sample = 1000);
p <- last_plot() + theme_bw();   # get the last plot
p + geom_smooth(aes(group = 1)); # plot with a trendline

Genotype accumulation curve for 694 isolates of the peach brown rot
pathogen, *Monilinia fructicola* genotyped over 13 loci from
@everhart2014fine. The horizontal axis represents the number of loci
randomly sampled without replacement up to *n - 1* loci, the vertical
axis shows the number of multilocus genotypes observed, up to 262, the
number of unique multilocus genotypes in the data set. The red dashed
line represents 90% of the total observed multilocus genotypes. A
trendline (blue) has been added using the *ggplot2* function
`stat_smooth`.

Figure 3.5: Genotype accumulation curve for 694 isolates of the peach brown rot pathogen, Monilinia fructicola genotyped over 13 loci from Everhart & Scherm (2015). The horizontal axis represents the number of loci randomly sampled without replacement up to n - 1 loci, the vertical axis shows the number of multilocus genotypes observed, up to 262, the number of unique multilocus genotypes in the data set. The red dashed line represents 90% of the total observed multilocus genotypes. A trendline (blue) has been added using the ggplot2 function stat_smooth.

3.3.5 Index of association

The index of association ($I_A$) is a measure of multilocus linkage disequilibrium that is most often used to detect clonal reproduction within organisms that have the ability to reproduce via sexual or asexual processes (Brown et al., 1980; Milgroom, 1996; Smith et al., 1993). It was standardized in 2001 as $\bar{r}_d$ by Agapow & Burt (2001) to address the issue of scaling with increasing number of loci. This metric is typically applied to traditional dominant and co-dominant markers such as AFLPs, SNPs, or microsatellite markers. With the advent of high throughput sequencing, SNP data is now available in a genome-wide context and in very large matrices including thousands of SNPs. For this reason, we devised two approaches using the index of association for large numbers of markers typical for population genomic studies. Both functions utilize adegenet’s “genlight” object class, which efficiently stores 8 binary alleles in a single byte (Jombart & Ahmed, 2011). As calculation of the $\bar{r}_d$ requires distance matrices of absolute number of differences, we utilize a function that calculates these distances directly from the compressed data called bitwise.dist.

The first approach is a sliding window analysis implemented in the function win.ia. It utilizes the position of markers in the genome to calculate $\bar{r}_d$ among any number of SNPs found within a user-specified windowed region. It is important that this calculation utilize $\bar{r}_d$ as the number of loci will be different within each window (Agapow & Burt, 2001). This approach would be suited for a quick calculation of linkage disequilibrium across the genome that can detect potential hotspots of LD that could be investigated further with more computationally intensive methods assuming that the number of samples << the number of loci.

As it would necessarily focus on loci within a short section of the genome that may or may not be recombining, a sliding window approach would not be good for utilizing $\bar{r}_d$ as a test for clonal reproduction. A remedy for this is implemented in the function samp.ia, which will randomly sample $m$ loci, calculate $\bar{r}_d$, and repeat $r$ times, thus creating a distribution of expected values of $\bar{r}_d$.

To demonstrate the sliding window and random sampling of $\bar{r}_d$ with respect to clonal populations, we simulated two populations containing 1,100 neutral SNPs for 100 diploid individuals under the same initial seed. One population had individuals randomly sampled with replacement, representing the clonal population. After sampling, both populations had 5% random error and 1% missing data independently propagated across all samples. On average, we obtained a higher value of $\bar{r}_d$ for the clonal population compared to the sexual population for both methods (Fig. 3.6).

$**(A)** Sliding window analysis of the standardized index of association ($\bar{r}_d$) across a simulated $1.1 \times 10^4$ nt chromosome containing 1,100 variants among 100 individuals. Each window analyzed variants within 500nt chunks. The black line refers to the clonal and the blue line to the sexual populations. **(B)** boxplots showing 100 random samples of 50 variants to calculate a distribution of $\bar{r}_d$ for the clonal (red) and sexual (blue) populations. Each box is centered around the mean, with whiskers extending out to 1.5 times the interquartile range. The median is indicated by the center line. **(A)** and **(B)** are plotted on the same y-axis.$

Figure 3.6: (A) Sliding window analysis of the standardized index of association ($\bar{r}_d$) across a simulated $1.1 \times 10^4$ nt chromosome containing 1,100 variants among 100 individuals. Each window analyzed variants within 500nt chunks. The black line refers to the clonal and the blue line to the sexual populations. (B) boxplots showing 100 random samples of 50 variants to calculate a distribution of $\bar{r}_d$ for the clonal (red) and sexual (blue) populations. Each box is centered around the mean, with whiskers extending out to 1.5 times the interquartile range. The median is indicated by the center line. (A) and (B) are plotted on the same y-axis.

3.3.6 Data format updates: population strata and hierarchies

Assessments of population structure through methods such as hierarchical $F_{st}$ (Goudet, 2005) and AMOVA (Michalakis & Excoffier, 1996) require hierarchical sampling of populations across space or time (Everhart & Scherm, 2015; Grünwald & Hoheisel, 2006; Linde et al., 2002). With clonal organisms, basic practice has been to clone-censor data to avoid downward bias in diversity due to duplicated genotypes that may or may not represent different samples (Milgroom, 1996). This correction should be performed with respect to a population hierarchy to accurately reflect the biology of the organism. Traditional data structures for population genetic data in most analysis tools allow for only one level of hierarchical definition. The investigator thus had to provide the data set for analysis at each hierarchical level.

To facilitate handling hierarchical and mutlilocus genotypic metadata, poppr version 1.1 introduced a new S4 data object called “genclone”, extending adegenet’s “genind” object (Kamvar and Grünwald, unpublished). The genclone object formalized the definitions of multilocus genotypes and population hierarchies by adding two slots called “mlg” and “hierarchy” that carried a numeric vector and a data frame, respectively. These new slots allow for increased efficiency and ease of use by allowing these metadata to travel with the genetic data. The hierarchy slot in particular contains a data frame where each column represents a separate hierarchical level. This is then used to set the population factor of the data by supplying a hierarchical formula containing one or more column names of the data frame in the hierarchy slot.

The functionality represented by the hierarchy slot has now been migrated from the poppr to the adegenet package version 2.0 to allow hierarchical analysis in adegenet, poppr, and other dependent packages. The prior poppr hierarchy slot and methods have now been renamed strata in adegenet. A short example of the utility of these methods can be seen in the code segment under Bootstrapping, above. This migration provides end users with a broader ability to analyze data hierarchically in R across packages.

3.4 Availability

As of this writing, the poppr R package version 2.0 containing all of the features described here is located at https://github.com/grunwaldlab/poppr/tree/2.0-rc. It is necessary to install adegenet 2.0 before installing poppr. It can be found at https://github.com/thibautjombart/adegenet. Both of these can be installed via the R package devtools (Wickham & Chang, 2015). More information and example code can be found in the supplementary materials⁴.

3.4.1 Requirements

R version 3.0 or better
A C compiler. For windows, it can be obtained via Rtools (http://cran.r-project.org/bin/windows/Rtools/). On OSX, it can be obtained via Xcode. For parallel support, gcc version 4.6 or better is needed.

3.4.2 Installation

From within R, poppr can be installed via:

install.packages("devtools")
library("devtools")
install_github("thibautjombart/adegenet")
install_github("grunwaldlab/poppr@2.0-rc")

Several population genetics packages in R are currently going through a major upgrade following the 2015 R hackathon on population genetics (https://github.com/NESCent/r-popgen-hackathon) and have not yet been updated in CRAN. We will upload poppr 2.0 to CRAN once all other reverse dependent packages have been updated.

3.5 Discussion

Given low cost and high throughput of current sequencing technologies we are entering a new era of population genetics where large SNP data sets with thousands of markers are becoming available for large populations in a genome- wide context. This data provides new possibilities and challenges for population genetic analyses. We provide novel tools that enable analysis of this data in R with a particular emphasis on clonal organisms.

Particularly useful is the implementation of $\bar{r}_d$ in a genomic context (Agapow & Burt, 2001). Random sampling of loci across the genome can give an expected distribution of $\bar{r}_d$, which is expected to have a mean of zero for panmictic populations. This metric is not affected by the number of loci sampled, is model free, and has the ability to detect population structure. $\bar{r}_d$ is also implemented for sliding window analyses that are useful to detect candidate regions of linkage disequilibrium for further analysis.

Clustering multilocus genotypes into multilocus lineages based on genetic distances is a non-trivial task given large SNP data sets. Moreover, this has not previously been implemented for genomic data for clonal populations. Clonal assignment has previously been available in the programs and for classical markers (Arnaud-Hanod et al., 2007; Meirmans & Van Tienderen, 2004). Our method with mlg.filter builds upon this idea and allows the user to choose between three different approaches for clustering MLGs. The choice of clustering algorithm has an impact on the data (Fig. 3.1, 3.2), where for example a genetic distance cutoff of 0.1 would be the difference between 14 multilocus lineages (MLLs) and 17 MLLs for nearest neighbor and UPGMA clustering, respectively (Fig. 3.2). The option to choose the clustering algorithm gives the user the ability to choose what is biologically relevant to their populations. While there is not one optimal procedure for defining boundaries in clonal lineages, our tool provides a means of exploring the potential MLG or MLL boundary space.

Minimum spanning networks are a useful tool to analyze the relationships between individuals in a population, because it reduces the complexity of a distance matrix to the connections that are strongest. By default, these networks are drawn without reticulations, but for clonal organisms where many of the connections between samples are equivalent, the minimum spanning network appears as a chain and reduces the information that can be communicated. This is problematic because the ability to detect population structure with one instance of a minimum spanning network is limited. Adding reticulation into the minimum spanning network thus presents all equivalent connections and allows population structure to be more readily detectable. As shown in Fig. 3.3, population structure is apparent both visually and by graph community detection algorithms such as the infoMAP algorithm (Rosvall & Bergstrom, 2008). Additionally, the current implementation in poppr has been successfully used in analyses such as reconstruction of the P. ramorum epidemic in Oregon forests (Kamvar et al., 2014 a, 2015 c).

Poppr 2.0 is open source and available on GitHub. Members of the community are invited to contribute by raising issues or pull requests on our repository at https://github.com/grunwaldlab/poppr/issues.

3.6 Acknowledgements

We thank Ignazio Carbone for discussions on the index of association; David Cooke, Sanmohan Baby, and Jens Hansen for beta testing; and Thibaut Jombart for allowing us to incorporate the strata slot and related methods in adegenet. We also thank all the members of the 2015 R hackathon on population genetics in Durham, NC for their advice and input (https://github.com/NESCent/r-popgen-hackathon). This work was supported in part by US Department of Agriculture (USDA) Agricultural Research Service Grant 5358-22000-039-00D, USDA National Institute of Food and Agriculture Grant 2011-68004-30154, USDA APHIS, the USDA-ARS Floriculture Nursery Initiative, and the USDA-Forest Service Forest Health Monitoring Program (to NJG).

References

Dobzhansky, T. (1973). Nothing in biology makes sense except in the light of evolution. The American Biology Teacher, 75(2), 87–91.

Grünwald, N. J., & Goss, E. M. (2011). Evolution and population genetics of exotic and re-emerging pathogens: Novel tools and approaches. Annual Review of Phytopathology, 49(1), 249–267. https://doi.org/10.1146/annurev-phyto-072910-095246

McDonald, B. A., & Linde, C. (2002b). The population genetics of plant pathogens and breeding strategies for durable resistance. Euphytica, 124(2), 163–180. https://doi.org/10.1023/A:1015678432355

Milgroom, M. G., Levin, S. A., & Fry, W. E. (1989). Population genetics theory and fungicide resistance. Plant Disease Epidemiology, 2, 340–367.

Anderson, J. B., & Kohn, L. M. (1995). Clonality in soilborne, plant-pathogenic fungi. Annual Review of Phytopathology, 33(1), 369–391. https://doi.org/10.1146/annurev.py.33.090195.002101

Milgroom, M. G. (1996). Recombination and the multilocus structure of fungal populations. Annual Review of Phytopathology, 34(1), 457–477.

Davey, J. W., Hohenlohe, P. A., Etter, P. D., Boone, J. Q., Catchen, J. M., & Blaxter, M. L. (2011). Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Reviews Genetics, 12(7), 499–510. https://doi.org/10.1038/nrg3012

Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A., Kawamoto, K., Buckler, E. S., & Mitchell, S. E. (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE, 6(5), e19379. https://doi.org/10.1371/journal.pone.0019379

Luikart, G., England, P. R., Tallmon, D., Jordan, S., & Taberlet, P. (2003). The power and promise of population genomics: From genotyping to genome typing. Nature Reviews Genetics, 4(12), 981–994.

Cooke, D. E. L., Cano, L. M., Raffaele, S., Bain, R. A., Cooke, L. R., Etherington, G. J., Deahl, K. L., Farrer, R. A., Gilroy, E. M., Goss, E. M., Grünwald, N. J., Hein, I., MacLean, D., McNicol, J. W., Randall, E., Oliva, R. F., Pel, M. A., Shaw, D. S., Squires, J. N., Taylor, M. C., Vleeshouwers, V. G. A. A., Birch, P. R. J., Lees, A. K., & Kamoun, S. (2012). Genome analyses of an aggressive and invasive lineage of the irish potato famine pathogen. PLoS Pathogens, 8(10), e1002940. https://doi.org/10.1371/journal.ppat.1002940

Falush, D., Stephens, M., & Pritchard, J. K. (2003). Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics, 164(4), 1567–1587. Retrieved from http://www.genetics.org/content/164/4/1567.abstract

Goss, E. M., Larsen, M., Chastagner, G. A., Givens, D. R., & Grünwald, N. J. (2009). Population genetic analysis infers migration pathways of Phytophthora ramorum in US nurseries. PLoS Pathogens, 5(9), e1000583. https://doi.org/10.1371/journal.ppat.1000583

Grünwald, N. J., & Hoheisel, G.-A. (2006). Hierarchical analysis of diversity, selfing, and genetic differentiation in populations of the oomycete Aphanomyces euteiches. Phytopathology, 96(10), 1134–1141.

Taylor, J. W., & Fisher, M. C. (2003). Fungal multilocus sequence typing – it’s not just for bacteria. Current Opinion in Microbiology, 6(4), 351–356.

Mastretta-Yanes, A., Arrigo, N., Alvarez, N., Jorgensen, T. H., Piñero, D., & Emerson, B. C. (2014). Restriction site-associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference. Molecular Ecology Resources, 15(1), 28–41. https://doi.org/10.1111/1755-0998.12291

R Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/

Kamvar, Z. N., Tabima, J. F., & Grünwald, N. J. (2014b). Poppr : an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ, 2, e281. https://doi.org/10.7717/peerj.281

Metzger, M. J., Reinisch, C., Sherry, J., & Goff, S. P. (2015). Horizontal transmission of clonal cancer cells causes leukemia in soft-shell clams. Cell, 161(2), 255–263.

Chakarov, N., Linke, B., Boerner, M., Goesmann, A., Krüger, O., & Hoffman, J. I. (2015). Apparent vector-mediated parent-to-offspring transmission in an avian malaria-like parasite. Molecular Ecology, 24(6), 1355–1363.

Gross, A., Hosoya, T., & Queloz, V. (2014). Population structure of the invasive forest pathogen Hymenoscyphus pseudoalbidus. Molecular Ecology, 23(12), 2943–2960.

Grünwald, N. J., Martin, F. N., Larsen, M. M., Sullivan, C. M., Press, C. M., Coffey, M. D., Hansen, E. M., & Parke, J. L. (2011). Phytophthora-ID.org: a sequence-based Phytophthora identification tool. Plant Disease, 95(3), 337–342.

Grünwald, N. J., Goodwin, S. B., Milgroom, M. G., & Fry, W. E. (2003). Analysis of genotypic diversity data for populations of microorganisms. Phytopathology, 93(6), 738–746. https://doi.org/10.1094/phyto.2003.93.6.738

Davey, J. W., & Blaxter, M. L. (2010). RADSeq: next-generation population genetics. Briefings in Functional Genomics, 9(5-6), 416–423. https://doi.org/10.1093/bfgp/elq031

Sokal, R. R. (1958). A statistical method for evaluating systematic relationships. Univ Kans Sci Bull, 38, 1409–1438.

Bruvo, R., Michiels, N. K., D’Souza, T. G., & Schulenburg, H. (2004). A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Molecular Ecology, 13(7), 2101–2106.

Dagum, L., & Menon, R. (1998). OpenMP: An industry standard API for shared-memory programming. Computational Science & Engineering, IEEE, 5(1), 46–55.

Goss, E. M., Tabima, J. F., Cooke, D. E., Restrepo, S., Fry, W. E., Forbes, G. A., Fieland, V. J., Cardenas, M., & Grünwald, N. J. (2014). The Irish potato famine pathogen Phytophthora infestans originated in central Mexico rather than the Andes. Proceedings of the National Academy of Sciences, 111(24), 8791–8796.

Yoshida, K., Schuenemann, V. J., Cano, L. M., Pais, M., Mishra, B., Sharma, R., Lanz, C., Martin, F. N., Kamoun, S., Krause, J., Thines, M., Weigel, D., & Burbano, H. A. (2013). The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine. eLife, 2. https://doi.org/10.7554/elife.00731

Lees, A., Wattier, R., Shaw, D., Sullivan, L., Williams, N., & Cooke, D. (2006). Novel microsatellite markers for the analysis of Phytophthora infestans populations. Plant Pathology, 55(3), 311–319.

Li, Y., Cooke, D. E., Jacobsen, E., & Lee, T. van der. (2013). Efficient multiplex simple sequence repeat genotyping of the oomycete plant pathogen Phytophthora infestans. Journal of Microbiological Methods, 92(3), 316–322.

Jombart, T., & Ahmed, I. (2011). Adegenet 1.3-1: New tools for the analysis of genome-wide SNP data. Bioinformatics, 27(21), 3070–3071.

Prevosti, A., Ocaña, J., & Alonso, G. (1975). Distances between populations of Drosophila subobscura, based on chromosome arrangement frequencies. Theoretical and Applied Genetics, 45(6), 231–241.

Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695. Retrieved from http://igraph.org

Jombart, T., Devillard, S., & Balloux, F. (2010). Discriminant analysis of principal components: A new method for the analysis of genetically structured populations. BMC Genetics, 11(1), 94. https://doi.org/10.1186/1471-2156-11-94

Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2014a, December). Sudden_Oak_Death_in_Oregon_Forests: Spatial and temporal population dynamics of the sudden oak death epidemic in Oregon Forests. ZENODO. https://doi.org/10.5281/zenodo.13007

Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4), 1118–1123.

Oksanen, J., Blanchet, F. G., Kindt, R., Legendre, P., Minchin, P. R., O’Hara, R. B., Simpson, G. L., Solymos, P., Stevens, M. H. H., & Wagner, H. (2015). Vegan: Community ecology package. Retrieved from http://CRAN.R-project.org/package=vegan

Shannon, C. E. (1948). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55.

Jombart, T. (2008). Adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24(11), 1403–1405. https://doi.org/10.1093/bioinformatics/btn129

Excoffier, L., Smouse, P. E., & Quattro, J. M. (1992). Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics, 131(2), 479–91.

Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer Research, 27(2 Part 1), 209–220.

Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences, 70(12), 3321–3323.

Laloë, D., Jombart, T., Dufour, A.-B., & Moazami-Goudarzi, K. (2007). Consensus genetic structuring and typological value of markers using multiple co-inertia analysis. Genetics Selection Evolution, 39(5), 1–23.

Arnaud-Hanod, S., Duarte, C. M., Alberto, F., & Serrão, E. A. (2007). Standardizing methods to address clonality in population studies. Molecular Ecology, 16(24), 5115–5139. https://doi.org/10.1111/j.1365-294X.2007.03535.x

Everhart, S. E., & Scherm, H. (2015). Fine-scale genetic structure of Monilinia fructicola during brown rot epidemics within individual peach tree canopies. Phytopathology, 105(4), 542–549. https://doi.org/10.1094/phyto-03-14-0088-r

Brown, A., Feldman, M., & Nevo, E. (1980). Multilocus structure of natural populations of Hordeum spontaneum. Genetics, 96(2), 523–536. Retrieved from http://www.genetics.org/content/96/2/523.abstract

Smith, J. M., Smith, N. H., O’Rourke, M., & Spratt, B. G. (1993). How clonal are bacteria? Proceedings of the National Academy of Sciences, 90(10), 4384–4388. https://doi.org/10.1073/pnas.90.10.4384

Agapow, P.-M., & Burt, A. (2001). Indices of multilocus linkage disequilibrium. Molecular Ecology Notes, 1(1-2), 101–102. https://doi.org/10.1046/j.1471-8278.2000.00014.x

Goudet, J. (2005). Hierfstat, a package for R to compute and test hierarchical F-statistics. Molecular Ecology Notes, 5(1), 184–186.

Michalakis, Y., & Excoffier, L. (1996). A generic estimation of population subdivision using distances between alleles with special reference for microsatellite loci. Genetics, 142(3), 1061–1064.

Linde, C., Zhan, J., & McDonald, B. (2002). Population structure of Mycosphaerella graminicola: From lesions to continents. Phytopathology, 92(9), 946–955.

Wickham, H., & Chang, W. (2015). Devtools: Tools to make developing R packages easier. Retrieved from http://CRAN.R-project.org/package=devtools

Meirmans, P. G., & Van Tienderen, P. H. (2004). GENOTYPE and GENODIVE: Two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes, 4(4), 792–794. https://doi.org/10.1111/j.1471-8286.2004.00770.x

Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2015c). Spatial and temporal analysis of populations of the sudden oak death pathogen in oregon forests. Phytopathology, 105(7), 982–989. https://doi.org/10.1094/phyto-12-14-0350-fi

Supplementary data available at https://github.com/grunwaldlab/supplementary-poppr-2.0; DOI: 10.5281/zenodo.17424 ↩
Supplementary data available at https://github.com/grunwaldlab/supplementary-poppr-2.0; DOI: 10.5281/zenodo.17424 ↩
Supplementary data available at https://github.com/grunwaldlab/supplementary-poppr-2.0; DOI: 10.5281/zenodo.17424 ↩