Chapter 6 Conclusions

Clearly, a discipline is defined by the questions asked, not the tools used.

Milgroom & Fry (1997, p. 4)

The above quote by Michael Milgroom and William Fry was in reference to the use of molecular markers in molecular biology as compared to population genetics. While it is undisputed that questions shape a field of inquiry, the notion that tools are not influential in disciplines is misleading. Tools are necessary for providing answers to the questions proposed; they are the vehicle whereby we apply our scientific theory to the unknown world.

A tool, in this sense is any instrument, physical or analytical, that is used to collect, measure, manipulate, represent, or analyze data (Gigerenzer, 1991). This definition encompasses things like hammers, hand lenses, mass spectrometers, maps, axioms, algorithms, gel electrophoresis, equations, etc. All of these tools are used within a theoretical framework (e.g. gravity, refraction); any observations or results produced with a particular tool are ultimately tied to the theory employed by the scientist using it (and would thus invoke different interpretations under a different theoretical framework) (Kuhn, 1996). If all the assumptions of the theoretical framework are met, the tool will produce an observation or result that will help the scientist describe the natural phenomena accurately in terms of a testable theory.

These tools, however, should not simply be seen as a means to an end for answering questions. Many tools will produce answers whether or not they are correct. A simple example of this concept was demonstrated by Anscombe (1973), showing the need for graphical visualization in statistical analysis. Reproduced in Fig. 6.1 are four data sets showing a fitted trendline. Using linear regression, all four data sets produce the exact same result (slope, intercept, variance, correlation). Upon visual inspection, their differences are striking.

A reproduction of Anscombe's quartet [@anscombe1973graphs] demonstrating
different situations in which linear regression would give the same
answer.

Figure 6.1: A reproduction of Anscombe’s quartet (Anscombe, 1973) demonstrating different situations in which linear regression would give the same answer.

If we imagine each data set as a separate population and linear regression as our tool, it wouldn’t matter what our question was, because there would be no hope of detecting any differentiation between these populations with the tool (e.g. molecular marker) chosen.

Science moves forward by asking questions about natural phenomena and then investigating the results of these questions further, narrowing the scope of the succeeding questions to pin down a detailed mechanism that can explain the phenomena observed. The tools provide the observations and results that set the context for future questions (Searls, 2010). I have presented such tools in the form of scientific software, which can enable reproducible research in the context of computational science.

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

– John Buckheit and David Donho paraphrasing John Claerbout (1995)

Buckheit & Donoho (1995) didn’t invent the concept of reproducible research in scientific computing, but they did show the computational research community that such a concept was possible. The ultimate goal of reproducible research lies in the term itself: to ensure that research is produced in a manner such that future researchers can faithfully reproduce and verify the results (Goecks et al., 2010). Lack of reproducibility has been (and still is) a problem due to varying factors, including software and point-and-click user interfaces (Ioannidis et al., 2008; Ziemann et al., 2016).

The development of scientific software does not stand apart from science itself, but rather it serves as implementation of scientific theory (Baxter et al., 2006; Ouzounis & Valencia, 2003; Partridge et al., 1984; Searls, 2010). Scientific software has been used to implement new theory (Agapow & Burt, 2001; Ali et al., 2016; Arnaud-Haond & Belkhir, 2006; Felsenstein, 1989) and it has been used to make accessible existing theory and methods in a way that is more accessible (Bailleul et al., 2016; Goudet, 1995; Kamvar et al., 2014b; Winter, 2012). Ultimately, publication and maintenance of scientific software exists to standardize the protocols in which we manipulate and analyze our data, allowing researchers to more efficiently and reliably produce answers to their questions. When scientists produce their research in an open and reproducible manner, the benefits not only include higher citation counts, but also increased potential for collaboration and data reuse (McKiernan et al., 2016; Wilson et al., 2014, 2016). It is thus imperative that the software available not simply exist as a black box from which a researcher can produce an answer; it must be useful, well tested, extensible, and open. In our experience, writing educational user manuals for software and responding to user feedback also help adoption and reproducibility.

We have provided in this dissertation descriptions of the research software, poppr. Chapters 2 and 3 expounded on the functionalities and benefits of poppr in terms of reproducible research, ease of use, and speed. It is important to acknowledge the fact that the progress of this work could not exist without open and collaborative software development. Chapter 3 was only possible because of the contributions I was able to make to adegenet (the package poppr depends on) during and after the NESCent Population Genetics in R Hackathon in 2014 (Jombart & Ahmed, 2011; Kamvar et al., 2015a; Paradis et al., 2016). An important aspect of software development is ‘eating your own dogfood’, that is, using your own software the way it was intended (Kamvar et al., 2016). We have used poppr in conjunction with the larger ecosystem of R packages to give evidence of two introductions of Phytophthora ramorum in Curry County, OR in a reproducible manner (Kamvar et al., 2014a, 2015c) and to show that the power of \(\bar{r}_d\) is affected by both sample size and allelic diversity in diploids. Beyond what has been published in scientific journals, we continue to develop poppr so that it holds to standards of scientific software development such as the use of rigorous testing, continuous integration, version control, community support, and sensible documentation (Baxter et al., 2006; Prlić & Procter, 2012; Wilson et al., 2016).

References

Milgroom, M., & Fry, W. (1997). Contributions of population genetics to plant disease epidemiology and management. In Advances in botanical research (pp. 1–30). Elsevier BV. https://doi.org/10.1016/s0065-2296(08)60069-5

Gigerenzer, G. (1991). From tools to theories: A heuristic of discovery in cognitive psychology. Psychological Review, 98(2), 254–267. https://doi.org/10.1037/0033-295x.98.2.254

Kuhn, T. S. (1996). The structure of scientific revolutions. University of Chicago Press. https://doi.org/10.7208/chicago/9780226458106.001.0001

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17. https://doi.org/10.2307/2682899

Searls, D. B. (2010). The roots of bioinformatics. PLoS Computational Biology, 6(6), e1000809. https://doi.org/10.1371/journal.pcbi.1000809

Buckheit, J. B., & Donoho, D. L. (1995). WaveLab and reproducible research. In Wavelets and statistics (pp. 55–81). Springer. https://doi.org/10.1007/978-1-4612-2544-7_5

Goecks, J., Nekrutenko, A., Taylor, J., & Team, T. G. (2010). Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology, 11(8), R86. https://doi.org/10.1186/gb-2010-11-8-r86

Ioannidis, J. P. A., Allison, D. B., Ball, C. A., Coulibaly, I., Cui, X., Culhane, A., Falchi, M., Furlanello, C., Game, L., Jurman, G., Mangion, J., Mehta, T., Nitzberg, M., Page, G. P., Petretto, E., & van Noort, V. (2008). Repeatability of published microarray gene expression analyses. Nature Genetics, 41(2), 149–155. https://doi.org/10.1038/ng.295

Ziemann, M., Eren, Y., & El-Osta, A. (2016). Gene name errors are widespread in the scientific literature. Genome Biology, 17(1). https://doi.org/10.1186/s13059-016-1044-7

Baxter, S. M., Day, S. W., Fetrow, J. S., & Reisinger, S. J. (2006). Scientific software development is not an oxymoron. PLoS Computational Biology, 2(9), e87. https://doi.org/10.1371/journal.pcbi.0020087

Ouzounis, C. A., & Valencia, A. (2003). Early bioinformatics: The birth of a discipline–a personal view. Bioinformatics, 19(17), 2176–2190. https://doi.org/10.1093/bioinformatics/btg309

Partridge, D., Lopez, P. D., & Johnston, V. S. (1984). Computer programs as theories in biology. Journal of Theoretical Biology, 108(4), 539–564. https://doi.org/10.1016/s0022-5193(84)80079-x

Agapow, P.-M., & Burt, A. (2001). Indices of multilocus linkage disequilibrium. Molecular Ecology Notes, 1(1-2), 101–102. https://doi.org/10.1046/j.1471-8278.2000.00014.x

Ali, S., Soubeyrand, S., Gladieux, P., Giraud, T., Leconte, M., Gautier, A., Mboup, M., Chen, W., Vallavieille-Pope, C., & Enjalbert, J. (2016). Cloncase: Estimation of sex frequency and effective population size by clonemate resampling in partially clonal organisms. Molecular Ecology Resources. https://doi.org/10.1111/1755-0998.12511

Arnaud-Haond, S., & Belkhir, K. (2006). Genclone: A computer program to analyse genotypic data, test for clonality and describe spatial clonal organization. Molecular Ecology Notes, 7(1), 15–17. https://doi.org/10.1111/j.1471-8286.2006.01522.x

Felsenstein, J. (1989). PHYLIP-phylogeny inference package (version 3.2). Cladistics, 5, 163–166.

Bailleul, D., Stoeckel, S., & Arnaud-Haond, S. (2016). RClone: A package to identify MultiLocus clonal lineages and handle clonal data sets inr. Methods in Ecology and Evolution, 7(8), 966–970. https://doi.org/10.1111/2041-210x.12550

Goudet, J. (1995). FSTAT (version 1.2): A computer program to calculate F-statistics. Journal of Heredity, 86(6), 485–486. Retrieved from http://jhered.oxfordjournals.org/content/86/6/485

Kamvar, Z. N., Tabima, J. F., & Grünwald, N. J. (2014b). Poppr : an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ, 2, e281. https://doi.org/10.7717/peerj.281

Winter, D. J. (2012). mmod: An R library for the calculation of population differentiation statistics. Molecular Ecology Resources, 12(6), 1158–1160. https://doi.org/10.1111/j.1755-0998.2012.03174.x

McKiernan, E. C., Bourne, P. E., Brown, C. T., Buck, S., Kenall, A., Lin, J., McDougall, D., Nosek, B. A., Ram, K., Soderberg, C. K., Spies, J. R., Thaney, K., Updegrove, A., Woo, K. H., & Yarkoni, T. (2016). How open science helps researchers succeed. eLife, 5. https://doi.org/10.7554/elife.16800

Wilson, G., Aruliah, D., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., Haddock, S. H., Huff, K. D., Mitchell, I. M., Plumbley, M. D., & others. (2014). Best practices for scientific computing. PLoS Biology, 12(1), e1001745.

Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2016). Good enough practices in scientific computing (Version 2). Retrieved from http://arxiv.org/abs/1609.00037v2

Jombart, T., & Ahmed, I. (2011). Adegenet 1.3-1: New tools for the analysis of genome-wide SNP data. Bioinformatics, 27(21), 3070–3071.

Kamvar, Z. N., Brooks, J. C., & Grünwald, N. J. (2015a). Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Frontiers in Genetics, 6. https://doi.org/10.3389/fgene.2015.00208

Paradis, E., Gosselin, T., Grünwald, N. J., Jombart, T., Manel, S., & Lapp, H. (2016). Towards an integrated ecosystem of R packages for the analysis of population genetic data. Molecular Ecology Resources. https://doi.org/10.1111/1755-0998.12636

Kamvar, Z. N., López-Uribe, M. M., Coughlan, S., Grünwald, N. J., Lapp, H., & Manel, S. (2016). Developing educational resources for population genetics in R: An open and collaborative approach. Molecular Ecology Resources. https://doi.org/10.1111/1755-0998.12558

Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2014a, December). Sudden_Oak_Death_in_Oregon_Forests: Spatial and temporal population dynamics of the sudden oak death epidemic in Oregon Forests. ZENODO. https://doi.org/10.5281/zenodo.13007

Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2015c). Spatial and temporal analysis of populations of the sudden oak death pathogen in oregon forests. Phytopathology, 105(7), 982–989. https://doi.org/10.1094/phyto-12-14-0350-fi

Prlić, A., & Procter, J. B. (2012). Ten simple rules for the open development of scientific software. PLoS Computational Biology, 8(12), e1002802. https://doi.org/10.1371/journal.pcbi.1002802