Because it's possible to have multiple results with a minimum number of samples, one way of assessing their importance is to calculate how distributed the alleles are among the samples. This can be done with entropy statistics

rpv_stats(tab, f = NULL)

Arguments

tab

a numeric matrix

f

a factor that is the same length as the number of columns in tab. this is used to split the matrix up by groups for analysis.

Value

a data frame with three columns: eH, G, E5, lambda, and missing

Details

This function caluclates four statistics from your data using variable counts.

  • eH: The exponentiation of shannon's entropy: exp(sum(-x * log(x))) (Shannon, 1948)

  • G : Stoddart and Taylor's index, or inverse Simpson's index: 1/sum(x^2) (Stoddart and Taylor, 1988; Simpson, 1949)

  • E5: Evenness (5) the ratio between the above two estimates: (G - 1)/(eH - 1) (Pielou, 1975)

  • lambda: Unbiased Simpson's index: (n/(n-1))*(1 - sum(x^2))

  • missing: the percent missing data out of the total number of cells.

Both G and eH can be thought of as the number of equally abundant variables to acheive the same observed diversity. Both G and eH give different weight to variables based on their abundance, so we use evenness to describe how uniform this distribution is.

Note that this version of Evenness is different than Shannon's Evenness, which is H/ln(S) where S is the number of variables (in our case).

If a vector of factors is supplied, the columns of the matrix is first split by this factor and each statistic calculated on each level.

Note

The calculations within this function are derived from the vegan and poppr R packages.

References

Claude Elwood Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379-423,623-656, 1948

Simpson, E. H. Measurement of diversity. Nature 163: 688, 1949 doi:10.1038/163688a0

J.A. Stoddart and J.F. Taylor. Genotypic diversity: estimation and prediction in samples. Genetics, 118(4):705-11, 1988.

E.C. Pielou. Ecological Diversity. Wiley, 1975.

Examples

# Calculate statistics for the whole data set ----------------------------- data(monilinia) rpv_stats(monilinia)
#> eH G E5 lambda missing #> 1 43.99737 33.36337 0.7526825 0.9703129 0.01000797
# Use a grouping factor for variables ------------------------------------- # Each variable in this data set represents and allele that is one of # thirteen loci. If we wanted a table across all loci individually, we can # group by locus name. f <- gsub("[.][0-9]+", "", colnames(monilinia)) f <- factor(f, levels = unique(f)) colMeans(emon <- rpv_stats(monilinia, f = f)) # average entropy across loci
#> eH G E5 lambda missing #> 3.62446099 2.89447982 0.72184259 0.61097340 0.01107226
emon
#> eH G E5 lambda missing #> CHMFc4 2.575682 2.400827 0.8890289 0.5856954 0.000000000 #> CHMFc5 1.690501 1.340743 0.4934713 0.2551374 0.026515152 #> CHMFc12 2.061671 1.990989 0.9334232 0.4997043 0.037878788 #> SEA 4.300812 2.865960 0.5653034 0.6535809 0.011363636 #> SED 4.981124 3.792067 0.7013263 0.7391126 0.007575758 #> SEE 1.832885 1.471073 0.5655922 0.3214559 0.011363636 #> SEG 3.079306 2.626818 0.7823848 0.6216841 0.007575758 #> SEI 3.927056 3.203771 0.7528968 0.6905033 0.007575758 #> SEL 3.489223 3.263458 0.9093032 0.6962238 0.003787879 #> SEN 4.674992 3.984450 0.8120970 0.7518723 0.000000000 #> SEP 3.734248 2.657942 0.6063612 0.6261589 0.007575758 #> SEQ 5.973290 4.456824 0.6950779 0.7785967 0.007575758 #> SER 4.797202 3.573316 0.6776875 0.7229284 0.015151515
# calculating entropy for minimum sets ------------------------------------ set.seed(1999) i <- rpv_find(monilinia, n = 150, cut = TRUE, progress = FALSE) colMeans(emon1 <- rpv_stats(monilinia[i[[1]], ], f = f))
#> eH G E5 lambda missing #> 4.588487224 3.544323824 0.707476453 0.692547682 0.007692308
colMeans(emon2 <- rpv_stats(monilinia[i[[2]], ], f = f))
#> eH G E5 lambda missing #> 4.527585941 3.438677804 0.699826401 0.687444739 0.007692308