Introduction

     In recent years, advances in genomic sequencing through next generation sequencing have enabled the development of millions of new markers, which have been consistently used in studies of important agronomic traits (Edwards and Batley, 2010). Hence, breeders have focused on studies that allow the association of markers with the phenotypic of interest, make predictions of performance or in research involving population studies and diversity analysis.
     After generating a large amount of genomic data, the genomic data must pass through a quality control and imputation of the missing genomic data. Some GS models that take advantage of dimensionality reduction like GBLUP and RKHS needs to construct relationship matrices. Moreover, the understanding of population genetics parameters is important as well.
      Therefore, there is the need of prepare genomic datasets, in such a way that it can be easily applied in a great range of studies. Hence, we propose proposed snpReady package based on needs of setting datasets ready to run genomic studies in leading genomic applications. Thus, we include in this package the three primary critical needs faced before running genomic analyses: preparation and quality control of datasets, estimation of relationship matrices and estimation of basic population genetics parameters.

Quality control

The function was created with the purpose of recoding and reshape the matrices obtained from different SNP genotyping platforms and let it ready to be used in genomic analyses. Thus, it reshapes, recodes, makes quality control and imputation of missing data in the dataset. It also cleans the map based on the same threshold used in the raw data set.
Marker matrix used as input can be organized in two ways. In the long format, for each subject, there is an observation of each SNP and its alleles. Thus, if there are n individuals and p markers, the matrix is the order of is (n × p) × 4 where columns represent samples and SNP identification and one for each allele, in this particular order. In order to illustrate the process, we use a maize data set with 64 lines and 539 SNPs.

library(snpReady)

## Loading required package: Matrix

## Loading required package: matrixcalc

## Loading required package: stringr

## Loading required package: rgl

## Loading required package: impute

geno <- read.table("http://italo-granato.github.io/geno.txt", header = TRUE, na.strings = "NA")
head(geno)

##   sample     marker allele.1 allele.2
## 1    A01 PHM4468.13        G        G
## 2    A01 PHM2770.19        G        G
## 3    A01 PZA00485.2        A        A
## 4    A01 PZA00522.7        A        A
## 5    A01 PZA00627.1        G        G
## 6    A01 PZA00473.5        G        G

dim(geno)

## [1] 34496     4

Another format that can be used as input is wide, where samples are in the row and SNPs in columns. In this case, the matrix is the order of n × p.

x[1:10,1:5]

##     PHM10225.15 PHM10321.11 PHM10404.8 PHM10525.11 PHM10525.9
## A01 "CC"        "GG"        NA         "CC"        "TT"      
## A02 "CG"        "CC"        "GG"       "TT"        "GG"      
## A03 "CC"        "GG"        "GG"       "CC"        "TT"      
## A04 "CC"        "GG"        "GG"       "TT"        "GG"      
## A05 "CC"        "GG"        "CC"       NA          "GG"      
## A06 "CC"        "GG"        "GG"       "TT"        "GG"      
## A07 "CC"        "CC"        "CC"       "CC"        "GG"      
## A08 "CC"        "GG"        "GG"       "CC"        "TT"      
## A09 "CC"        "CC"        "CC"       "TT"        "GG"      
## A10 "CC"        "GG"        "GG"       "TT"        "GG"

     The raw data can be coded as the nitrogenous base (A, C, G, and T) or the standard A and B. However, if the data was already recoded this can be set by base argument and only quality control is made. Thus, if the base is FALSE dataset must be coded as 0, 1 and 2. Missing data should be set as NA.
     Quality control (QC) for genomic data is based on removing individuals and markers with poor information. In general, for individuals, it can be associated with the amount of missing markers. Hence, samples which do not meet some threshold of missing data can be removed through sweep.sample. For markers, QC is based on allele frequency and the amount of missing data. Markers with a low frequency of one of its alleles usually are non-informative and in some situations are considered monomorphic. Therefore, they can be removed. The same trend is applied for missing data. Thus, the QC process is made through MAF and call.rate.
     Along with removing non-informative markers, imputation of missing data becomes necessary. We present two types of map-independent imputation. One is based on the Wright equilibrium. For a missing position, we assume that the probability of taking up value is dependent on both allele frequency of the SNP and the level of homozygosity of an individual. Thus: $$P(x_{ij})=\left\{ \begin{array}{ll} P(x = 0) = (1 - p_j)^2+ p_j (1 - p_j ) F_i\\ P(x = 1) = 2p_j (1 - p_j )-2p_j (1 - p_j)\\ P(x=2)= p_j^2+p_j (1 - p_j ) F_i \end{array} \right.$$ Where p_i is the frequency of the major allele for an SNP i, and F_j is the level of homozygosity of an individual j estimate as a proportion of the amount of homozygous loci relative to the total of loci. Another method of imputation currently implemented is based on the mean of each SNP. Each missing position of a SNP j is replaced by its mean. Thus: p̄ = 2p_j      In this section we are going to present a basic usage for raw.data in a maize data set. It is composed of 64 inbred lines genotyped with 539 SNPs. First, let’s run a basic quality control on this data set.

geno.ready <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = FALSE)
M <- geno.ready$M.clean
M[1:10,1:5]

##     PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7
## A01           2           0          0           0          0
## A02           1           2          2           0          0
## A03           2           0          0           0          0
## A04           2           0          2           0          0
## A05           2           0          2           2          2
## A06           2           0          2           0          0
## A07           2           2          2           0          0
## A08           2           0          0           0          0
## A09           2           2          2           0          2
## A10           2           0          2           0          0

Above, we recode and clean the dataset according to some quality control parameters. Using a call rate of 0.95 means that it can accept markers with only a maximum of 5% of missing data. MAF of 0.1 means that markers with a minor frequency allele less than 0.1 were removed. For individuals,the threshold used for sweep.sample was 0.5, which means that samples with more than 50% of missing data were removed.

Besides the cleaned matrix, a report is also outputted as a summary on how many and what markers were removed by the steps in the quality control.

geno.ready$report

## $maf
## $maf$r
## [1] "67 Markers removed by MAF = 0.1"
## 
## $maf$whichID
##  [1] "PHM10750.26" "PHM12693.8"  "PHM12904.7"  "PHM1506.23"  "PHM15871.11"
##  [6] "PHM16605.19" "PHM175.25"   "PHM18705.23" "PHM2159.8"   "PHM2177.85" 
## [11] "PHM2770.19"  "PHM2773.30"  "PHM3612.19"  "PHM3637.15"  "PHM3688.14" 
## [16] "PHM3690.23"  "PHM3691.15"  "PHM3691.18"  "PHM3931.17"  "PHM424.13"  
## [21] "PHM424.16"   "PHM4313.17"  "PHM4339.79"  "PHM4349.6"   "PHM4469.13" 
## [26] "PHM4552.6"   "PHM4662.153" "PHM4757.14"  "PHM4951.8"   "PHM5529.4"  
## [31] "PHM5535.8"   "PHM5727.5"   "PHM574.14"   "PHM5740.9"   "PHM7616.35" 
## [36] "PHM8074.6"   "PHM835.25"   "PHM9672.9"   "PZA00103.20" "PZA00192.6" 
## [41] "PZA00213.19" "PZA00216.2"  "PZA00276.18" "PZA00344.10" "PZA00381.3" 
## [46] "PZA00425.9"  "PZA00525.17" "PZA00573.3"  "PZA00615.3"  "PZA00658.19"
## [51] "PZA00730.2"  "PZA00804.1"  "PZA00878.2"  "PZA00881.1"  "PZA01790.1" 
## [56] "PZA02151.3"  "PZA02167.2"  "PZA02820.17" "PZA02850.18" "PZA02921.9" 
## [61] "PZA02923.7"  "PZA02949.22" "PZA02952.10" "PZA03011.6"  "PZA03035.5" 
## [66] "PZA03063.18" "PZA03083.7" 
## 
## 
## $cr
## $cr$r
## [1] "151 Markers removed by Call Rate = 0.95"
## 
## $cr$whichID
##   [1] "PHM10404.8"   "PHM10525.11"  "PHM112.8"     "PHM1155.14"   "PHM11985.27" 
##   [6] "PHM12992.5"   "PHM1307.11"   "PHM13094.8"   "PHM13639.13"  "PHM13648.11" 
##  [11] "PHM13675.18"  "PHM13823.7"   "PHM1438.34"   "PHM14618.11"  "PHM14671.9"  
##  [16] "PHM1506.23"   "PHM15331.16"  "PHM1534.45"   "PHM15449.10"  "PHM15501.9"  
##  [21] "PHM15961.13"  "PHM1684.20"   "PHM1745.16"   "PHM175.25"    "PHM17698.8"  
##  [26] "PHM18513.156" "PHM1870.20"   "PHM18887.12"  "PHM1899.157"  "PHM1932.51"  
##  [31] "PHM2006.57"   "PHM2177.85"   "PHM229.15"    "PHM2343.25"   "PHM2350.14"  
##  [36] "PHM2487.6"    "PHM2749.10"   "PHM3094.23"   "PHM3147.18"   "PHM3155.14"  
##  [41] "PHM3171.5"    "PHM3309.8"    "PHM3463.18"   "PHM3512.186"  "PHM3627.11"  
##  [46] "PHM3631.47"   "PHM3668.12"   "PHM3691.18"   "PHM3844.14"   "PHM3852.15"  
##  [51] "PHM3963.33"   "PHM4196.27"   "PHM4348.16"   "PHM4662.153"  "PHM4818.15"  
##  [56] "PHM4880.179"  "PHM4905.6"    "PHM5296.6"    "PHM5480.17"   "PHM5572.19"  
##  [61] "PHM5622.21"   "PHM563.9"     "PHM5637.15"   "PHM5798.39"   "PHM5805.19"  
##  [66] "PHM5822.15"   "PHM597.18"    "PHM662.27"    "PHM7417.21"   "PHM7898.10"  
##  [71] "PHM8074.6"    "PHM9162.135"  "PHM9241.13"   "PHM9635.30"   "PHM9676.10"  
##  [76] "PZA00004.2"   "PZA00049.12"  "PZA00058.5"   "PZA00061.1"   "PZA00084.2"  
##  [81] "PZA00099.6"   "PZA00153.3"   "PZA00182.4"   "PZA00192.6"   "PZA00216.2"  
##  [86] "PZA00219.7"   "PZA00220.11"  "PZA00289.11"  "PZA00311.4"   "PZA00323.3"  
##  [91] "PZA00334.2"   "PZA00345.15"  "PZA00395.2"   "PZA00399.10"  "PZA00405.7"  
##  [96] "PZA00423.16"  "PZA00425.9"   "PZA00439.6"   "PZA00447.6"   "PZA00453.2"  
## [101] "PZA00463.3"   "PZA00492.26"  "PZA00516.3"   "PZA00524.2"   "PZA00525.17" 
## [106] "PZA00562.4"   "PZA00588.2"   "PZA00636.6"   "PZA00653.5"   "PZA00672.6"  
## [111] "PZA00684.12"  "PZA00714.1"   "PZA00725.4"   "PZA00730.2"   "PZA00881.1"  
## [116] "PZA00934.2"   "PZA00941.2"   "PZA01073.1"   "PZA01280.2"   "PZA01327.1"  
## [121] "PZA01342.2"   "PZA01359.1"   "PZA01451.1"   "PZA01462.1"   "PZA01623.3"  
## [126] "PZA01713.4"   "PZA01715.1"   "PZA01765.1"   "PZA01810.2"   "PZA01887.1"  
## [131] "PZA02049.1"   "PZA02247.1"   "PZA02249.4"   "PZA02462.1"   "PZA02478.7"  
## [136] "PZA02688.2"   "PZA02727.1"   "PZA02731.1"   "PZA02746.2"   "PZA02820.17" 
## [141] "PZA02872.1"   "PZA02890.4"   "PZA02949.22"  "PZA02957.5"   "PZA02958.17" 
## [146] "PZA03027.23"  "PZA03028.5"   "PZA03036.6"   "PZA03076.10"  "PZA03090.31" 
## [151] "PZA03659.1"  
## 
## 
## $sweep
## $sweep$r
## [1] "2 Samples removed by sweep.sample = 0.5"
## 
## $sweep$whichID
## [1] "E08" "E09"
## 
## 
## $imput
## [1] "No data point was imputed"

For instance, 67 and 151 markers were removed by MAF and call rate, respectively. Moreover, two individuals were removed. Furthermore, the report shows the SNP and sample IDs removed by each procedure. It is important highlight that some SNPs may fail in both controls applied. Thus, their identification will appear in both sections. Here, 204 markers were removed, which 14 did not attend either QC applied.

Now, the quality control can be combined with imputation. When using the imputation, it is needed to choose between the supported methods.

geno.ready2 <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = TRUE, imput.type = "wright", outfile = "012")
Mwrth <- geno.ready2$M.clean
Mwrth[1:10,1:5]

##     PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7
## A01           2           0          0           0          0
## A02           1           2          2           0          0
## A03           2           0          0           0          0
## A04           2           0          2           0          0
## A05           2           0          2           2          2
## A06           2           0          2           0          0
## A07           2           2          2           0          0
## A08           2           0          0           0          0
## A09           2           2          2           0          2
## A10           2           0          2           0          0

Regarding the Wright’s method, its accuracy is about 62%. Imputation based on mean also have intermediate accuracy. These methods are less accurate than those map-dependent methods, such as the one developed in BEAGLE (Browning and Browning, 2016). However, as described by Rutkoski et al. (2013) with lower missing data rates allowed a mere imputation is enough.

Concerning the output formats, the outfile argument is used to export the cleaned matrix in the suitable format. Default is the count of reference allele such that AA = 2, Aa = 1, aa = 0. Another format is coded as -1, 0, 1, case considering p_j as 0.5. Finally, a special format is suitable for use in the STRUCTURE software (Pritchard et al., 2000). To generate this output, is necessary the raw data be coded as nitrogenous bases.

geno.readySTR <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = FALSE, outfile = "structure")
Mstr <- geno.readySTR$M.clean
Mstr[1:10,1:5]

##     PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7
## A01           2           3          4           3          4
## A01           2           3          4           3          4
## A02           2           2          3           3          4
## A02           3           2          3           3          4
## A03           2           3          4           3          4
## A03           2           3          4           3          4
## A04           2           3          3           3          4
## A04           2           3          3           3          4
## A05           2           3          3           1          2
## A05           2           3          3           1          2

In this output, each sample is split into two rows, one for each allele. Nitrogenous bases are then recoded to a specific number such that A is 1, C is 2, G is 3 and T is 4 and missing data are assigned as -9. Also, given that STRUCTURE can handle missing data, arguments related to imputation are ignored when this output is selected.

Genomic relationship matrix

Genomic relationship matrices (GRM) are created through the G.matrix function. Genomic prediction models use these matrices, especially in the G-BLUP. Different kinship parametrizations were proposed with the aim of increasing the accuracy of prediction of genomic selection. G.matrix function estimates four types of additive and one dominant genomic relationship matrix.
The matrix used in the data entry should be coded as allele count, where AA = 2, Aa = 1, aa = 0. However, it accepts continuous values for some SNPs. The second argument is the method to be used to generate the GRM. There are four methods to generate it, three forms of GRM and the Gaussian kernel (GK). Methods currently implemented are the one proposed by VanRaden (2008), two methods proposed by Yang et al. (2010), the UAR (Unified Additive Relationship) and adjusted UAR, and the Gaussian kernel (Pérez-Elizalde et al., 2015).

G <- G.matrix(M = Mwrth, method = "VanRaden", format = "wide") 
Ga <- G$Ga
Ga[1:5,1:5]

##             A01         A02         A03         A04         A05
## A01  0.94726809 -0.16011571 -0.04074426 -0.06898057 -0.10208522
## A02 -0.16011571  0.95804332  0.01897392 -0.06962968 -0.01822012
## A03 -0.04074426  0.01897392  0.92312117 -0.03879693 -0.01153428
## A04 -0.06898057 -0.06962968 -0.03879693  0.92299135  0.08498848
## A05 -0.10208522 -0.01822012 -0.01153428  0.08498848  1.04593291

Gd <- G$Gd
Gd[1:5,1:5]

##           A01       A02       A03       A04       A05
## A01 0.8992808 0.4032560 0.4343691 0.4178845 0.5012764
## A02 0.4032560 0.9419096 0.3841906 0.4090457 0.4816250
## A03 0.4343691 0.3841906 0.8476102 0.4048261 0.4843004
## A04 0.4178845 0.4090457 0.4048261 0.8454841 0.5108710
## A05 0.5012764 0.4816250 0.4843004 0.5108710 1.0435704

Just for the vanRaden method, as shown above, two matrices additive and dominance are generated. Otherwise, only the additive matrix is outputted.

G <- G.matrix(M = Mwrth, method = "UAR", format = "wide") 
G[1:5,1:5]

##             A01          A02          A03        A04          A05
## A01  1.85023583 -0.256096924 -0.072284174 -0.1401453 -0.157504431
## A02 -0.25609692  1.827051568 -0.002434675 -0.1079725 -0.062167032
## A03 -0.07228417 -0.002434675  1.838119031 -0.0706931 -0.005500161
## A04 -0.14014529 -0.107972530 -0.070693097  1.8399097  0.165851895
## A05 -0.15750443 -0.062167032 -0.005500161  0.1658519  1.938086241

Two forms of output are generated, one as a matrix of order n × n.

dim(G)

## [1] 62 62

Another output is the long format, such that the inverse is used to create a table in a suitable format for ASREML-R, where three columns are representing the row, columns and respective value of the lower diagonal matrix.

G <- G.matrix(M = Mwrth, method = "UAR", format = "long")

## Warning in posdefmat(Ga): The matrix was adjusted for the nearest positive
## definite matrix

head(G)

##     row column    value
## 1     1      1 202585.6
## 63    2      1 213470.1
## 64    2      2 224940.7
## 125   3      1 214852.5
## 126   3      2 226397.8
## 127   3      3 227986.0

These two forms are suitable to use in many aplications like BGLR (Pérez and De Los Campos, 2014), rrBLUP (Endelman, 2011) and ASREML-R (Butler et al., 2009).

Population genetics summary

The popgen function aims to produce a summary about population genetics parameters. Thus, for any marker locus j with alleles A₁ and A₂, the function estimates:

Allele frequencies

$\begin{aligned} f(A_1) = p_j = \frac{nA_1}{2N} \\ \\ f(A_2) = q_j = 1 - p_j \end{aligned}$

Minor allele frequency

maf = min(p_j, q_j)

Observed heterozygosity

$\begin{aligned} H_o=\frac{nH_j}{N} \end{aligned}$

Expected heterozygosity

H_e = 2p_jq_j

Nei’s genetic diversity

DG = 1 − p_j² − q_j²

Polymorphic information content

PIC = 1 − (p_j² + q_j²) − (2p_j²q_j²)

Missing rate

$\begin{aligned} H_o=\frac{nNA_j}{N} \end{aligned}$

Hardy-Weinberg equilibrium statistic

$\begin{aligned} \chi^2=\frac{1}{d}\sum_{k=1}^{3} \frac{(O_k - E_k)^2}{E_k} \end{aligned}$

where nA₁ is the number of copies of A₁ allele in the population, N is the number of individuals, nH_j is the number of heterozygous genotypes (of type A₁A₂ or A₂A₁) in the locus j, O_k is the observed values for the genotypes 0, 1 and 2, E_k is the expected values for 0 = N * (1 − p_j)², 1 = N * 2p_j(1 − p_j) and 2 = N * p_j².

Moreover, for any individual i, the function provides estimates of:

Observed heterozygosity

$\begin{aligned} H_{o_i} = \frac{nH_i}{m} \end{aligned}$

Inbreeding coefficient

$\begin{aligned} F_i=\frac{O(H_i)-E(H)}{m-E(H)} \end{aligned}$

where nH_i is the number of heterozigous genotypes (of type A₁A₂ or A₂A₁) in the individual i, m is the number of markers, O(H_i) is the observed homozygosity for individual i, E(H) = ∑_j1 − 2p_j(1 − p_j) is the expected homozygosity across all snps.

Then, based on the estimates described above, populational parameters, means and bounders of DG, PIC, MAF, Ho, F and S are provided. Besides, measures of variability are estimated:

Effective populational size

Additive variance due to the allele frequencies

$\begin{aligned} Va = \sum_{j=1}^{m} 2p_jq_j \end{aligned}$

Dominance variance due to the allele frequencies

$\begin{aligned} Vd = \sum_{j=1}^{m} (2 p_{j} q_{j})^2 \end{aligned}$

where $\bar{F_i} = \frac{\sum_{i=1}^{N} F_i}{N}$ is the mean of F_i

In general, data set needs to be coded as allele count and missing data can be accepted. Estimates for the whole population is presented in a list. Thus, there are estimates for Genotype, Markers, Population and Variability.

pop.gen <- popgen(M = Mwrth)
head(pop.gen$whole$Markers)

##                p    q  MAF   He   Ho   GD  PIC Miss  chiSq         pval
## PHM10225.15 0.90 0.10 0.10 0.19 0.02 0.19 0.17    0 51.802 6.138145e-13
## PHM10321.11 0.34 0.66 0.34 0.45 0.03 0.45 0.35    0 53.392 2.731601e-13
## PHM10525.9  0.74 0.26 0.26 0.38 0.00 0.38 0.31    0 62.000 3.434573e-15
## PHM11000.21 0.15 0.85 0.15 0.25 0.03 0.25 0.22    0 46.930 7.356560e-12
## PHM11114.7  0.44 0.56 0.44 0.49 0.03 0.49 0.37    0 54.131 1.875179e-13
## PHM11226.13 0.43 0.57 0.43 0.49 0.02 0.49 0.37    0 57.981 2.646319e-14

head(pop.gen$whole$Genotypes)

##       Ho    Fi
## A01 0.02 0.938
## A02 0.06 0.838
## A03 0.01 0.962
## A04 0.01 0.962
## A05 0.03 0.923
## A06 0.03 0.915

head(pop.gen$whole$Population)

##     mean lower upper
## GD  0.39  0.17  0.50
## PIC 0.31  0.16  0.38
## MAF 0.30  0.10  0.50
## Ho  0.04  0.00  0.14
## F   0.91  0.63  1.00

head(pop.gen$whole$Variability)

##                     estimate
## Ne                     34.10
## Va                    130.15
## Vd                     53.84
## number.of.genotypes    62.00
## number.of.markers     335.00

The popgen function also allows the assignment of subpopulations to individuals. It estimates the same population genetic parameters for each subpopulation, such as effective population size, components of additive and dominance variance, and endogamy. In our example, let’s split the whole population into two subpopulations according to the nitrogen use efficiency (NE), the high and low one.

subgroups <- as.matrix(c(rep("HNE", 10), rep("LNE", 52)))
pop.gen <- popgen(M = Mwrth, subgroups = subgroups)

It produces for each group the same estimates described above. For example, in the HNE group:

head(pop.gen$bygroup$HNE$Markers)

##                p    q  MAF   He  Ho   GD  PIC Miss  chiSq        pval
## PHM10225.15 0.95 0.05 0.05 0.10 0.1 0.10 0.09    0  0.028 0.867814108
## PHM10321.11 0.30 0.70 0.30 0.42 0.0 0.42 0.33    0 10.000 0.001565402
## PHM10525.9  0.70 0.30 0.30 0.42 0.0 0.42 0.33    0 10.000 0.001565402
## PHM11000.21 0.10 0.90 0.10 0.18 0.0 0.18 0.16    0 10.000 0.001565402
## PHM11114.7  0.20 0.80 0.20 0.32 0.0 0.32 0.27    0 10.000 0.001565402
## PHM11226.13 0.55 0.45 0.45 0.50 0.1 0.50 0.37    0  6.368 0.011621498

head(pop.gen$bygroup$HNE$Genotypes)

##       Ho    Fi
## A01 0.02 0.931
## A02 0.06 0.819
## A03 0.01 0.957
## A04 0.01 0.957
## A05 0.03 0.914
## A06 0.03 0.905

and for LNE group:

head(pop.gen$bygroup$LNE$Markers)

##                p    q  MAF   He   Ho   GD  PIC Miss  chiSq         pval
## PHM10225.15 0.88 0.12 0.12 0.20 0.00 0.20 0.18    0 52.000 5.550063e-13
## PHM10321.11 0.35 0.65 0.35 0.45 0.04 0.45 0.35    0 43.539 4.156341e-11
## PHM10525.9  0.75 0.25 0.25 0.38 0.00 0.38 0.30    0 52.000 5.550063e-13
## PHM11000.21 0.15 0.85 0.15 0.26 0.04 0.26 0.23    0 37.771 7.954843e-10
## PHM11114.7  0.48 0.52 0.48 0.50 0.04 0.50 0.37    0 44.297 2.821839e-11
## PHM11226.13 0.40 0.60 0.40 0.48 0.00 0.48 0.37    0 52.000 5.550063e-13

head(pop.gen$bygroup$LNE$Genotypes)

##       Ho    Fi
## A11 0.01 0.962
## A12 0.02 0.954
## B01 0.02 0.954
## B02 0.07 0.808
## B03 0.09 0.762
## B04 0.04 0.892

Moreover, the separation by groups allows the identification of exclusive/absent and fixed alleles in the assigned sub-populations. Which are in the HNE:

head(pop.gen$bygroup$HNE$exclusive)

## [1] "There are no exclusive alleles for this group"

head(pop.gen$bygroup$HNE$fixed)

## [1] "PHM15278.6" "PHM1576.25" "PHM1962.33" "PHM2518.28" "PHM2672.19"
## [6] "PHM3402.11"

and in the LNE:

head(pop.gen$bygroup$LNE$exclusive)

## [1] "PHM15278.6" "PHM1576.25" "PHM1962.33" "PHM2518.28" "PHM2672.19"
## [6] "PHM3402.11"

head(pop.gen$bygroup$LNE$fixed)

## [1] "There are no fixed alleles for this group"

As can be noted, in the HNE group there are many fixed alleles and no particular alleles. On the other hand, in LNE, the opposite was observed. Hence, for two populations, considering that quality control was made and monomorphic markers were removed, if one allele is fixed in one population, the other will be exclusive in another subpopulation.

In the presence of subgroups, popgen also produces the Wright’s F statistics in order to measure genetic divergence between populations. F_IS measures the deficiency of average heterozygotes in each population, F_ST is related to the degree of gene differentiation among populations and F_IT is the deficiency of average heterozygotes in subpopulations. The parameters are estimated as follows:

$\begin{aligned} F_{IS} = 1 - \dfrac{H_I}{H_S} \end{aligned}$

$\begin{aligned} F_{ST} = 1 - \dfrac{H_S}{H_T} \end{aligned}$

$\begin{aligned} F_{IT} = 1 - \dfrac{H_I}{H_T} \end{aligned}$

where H_I is the weighted average of observed heterozygosity in the subpopulations, H_S is the average expected heterozygosity estimated from each subpopulation and H_T is the expected heterozygosity in the total population. Besides the estimates of F parameters for each marker locus, estimates considering means of H_I, H_T and H_S for each population is produced. Thus, There is a comparison considering all population and in a pairwise format which each pair of populations are compared against each other.

References

[1] Edwards, D. and J. Batley. 2010. Plant genome sequencing: applications for crop improvement. Plant Biotechnology Journal 8: 2-9.Available Here

[2] Browning, B.L., and S.R. Browning. 2016. Genotype Imputation with Millions of Reference Samples. American Journal of Human Genetics 98(1): 116–126. Available Here

[3] Butler, D.G., B.R. Cullis, A.R. Gilmour, and B.J. Gogel. 2009. ASReml-R reference manual. Available Here

[4] Endelman, J.B. 2011. Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. The Plant Genome 4: 250–255. Available Here

[5] Pérez, P., and G. De Los Campos. 2014. Genome-wide regression & prediction with the BGLR statistical package. Genetics 198(2): 483–495.

[6] Pérez-Elizalde, S., J. Cuevas, P. Pérez-Rodríguez, and J. Crossa. 2015. Selection of the Bandwidth Parameter in a Bayesian Kernel Regression Model for Genomic-Enabled Prediction. Journal of Agricultural, Biological, and Environmental Statistics 20(4): 512–532. Available Here

[7] Pritchard, J.K., M. Stephens, and P. Donnelly. 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics 155(2): 945-959.

[8] Rutkoski, J.E., J. Poland, J.-L. Jannink, and M.E. Sorrells. 2013. Imputation of Unordered Markers and the Impact on Genomic Selection Accuracy. G3: Genes, Genomes, Genetics 3(3): 427–439. Available Here

[9] VanRaden, P.M. 2008. Efficient methods to compute genomic predictions. Journal of dairy science 91(11): 4414–23. Available Here

[10] Yang, J., B. Benyamin, B.P. Mcevoy, S. Gordon, A.K. Henders, R. Dale, P.A. Madden, A.C. Heath, N.G. Martin, G.W. Montgomery, M.E. Goddard, and P.M. Visscher. 2010. Common SNPs explain a large proportion of heritability for human height. Nature 569: 565–569. Available Here

snpReady: a helper tool to run genomic analysis