In recent years, advances in genomic sequencing through next
generation sequencing have enabled the development of millions of new
markers, which have been consistently used in studies of important
agronomic traits (Edwards and Batley, 2010). Hence,
breeders have focused on studies that allow the association of markers
with the phenotypic of interest, make predictions of performance or in
research involving population studies and diversity analysis.
After generating a large amount of genomic data, the genomic data
must pass through a quality control and imputation of the missing
genomic data. Some GS models that take advantage of dimensionality
reduction like GBLUP and RKHS needs to construct relationship matrices.
Moreover, the understanding of population genetics parameters is
important as well.
Therefore, there is the need of prepare genomic datasets, in such
a way that it can be easily applied in a great range of studies. Hence,
we propose proposed snpReady package based on needs of setting datasets
ready to run genomic studies in leading genomic applications. Thus, we
include in this package the three primary critical needs faced before
running genomic analyses: preparation and quality control
of datasets, estimation of relationship matrices
and estimation of basic population genetics
parameters.
The function was created with the purpose of recoding and
reshape the matrices obtained from different SNP genotyping platforms
and let it ready to be used in genomic analyses. Thus, it reshapes,
recodes, makes quality control and imputation of missing data in the
dataset. It also cleans the map based on the same threshold used in the
raw data set.
Marker matrix used as input can be organized in two ways. In the
long format, for each subject, there is an observation of each
SNP and its alleles. Thus, if there are n individuals and p markers, the matrix is the order
of is (n × p) × 4
where columns represent samples and SNP identification and one for each
allele, in this particular order. In order to illustrate the process, we
use a maize data set with 64 lines and 539 SNPs.
## Loading required package: Matrix
## Loading required package: matrixcalc
## Loading required package: stringr
## Loading required package: rgl
## Loading required package: impute
geno <- read.table("http://italo-granato.github.io/geno.txt", header = TRUE, na.strings = "NA")
head(geno)
## sample marker allele.1 allele.2
## 1 A01 PHM4468.13 G G
## 2 A01 PHM2770.19 G G
## 3 A01 PZA00485.2 A A
## 4 A01 PZA00522.7 A A
## 5 A01 PZA00627.1 G G
## 6 A01 PZA00473.5 G G
## [1] 34496 4
Another format that can be used as input is wide, where samples are in the row and SNPs in columns. In this case, the matrix is the order of n × p.
## PHM10225.15 PHM10321.11 PHM10404.8 PHM10525.11 PHM10525.9
## A01 "CC" "GG" NA "CC" "TT"
## A02 "CG" "CC" "GG" "TT" "GG"
## A03 "CC" "GG" "GG" "CC" "TT"
## A04 "CC" "GG" "GG" "TT" "GG"
## A05 "CC" "GG" "CC" NA "GG"
## A06 "CC" "GG" "GG" "TT" "GG"
## A07 "CC" "CC" "CC" "CC" "GG"
## A08 "CC" "GG" "GG" "CC" "TT"
## A09 "CC" "CC" "CC" "TT" "GG"
## A10 "CC" "GG" "GG" "TT" "GG"
The raw data can be coded as the nitrogenous base (A, C, G, and
T) or the standard A and B. However, if the data was already recoded
this can be set by base
argument and only quality control
is made. Thus, if the base is FALSE
dataset must be coded
as 0, 1 and 2. Missing data should be set as NA.
Quality control (QC) for genomic data is based on removing
individuals and markers with poor information. In general, for
individuals, it can be associated with the amount of missing markers.
Hence, samples which do not meet some threshold of missing data can be
removed through sweep.sample
. For markers, QC is based on
allele frequency and the amount of missing data. Markers with a low
frequency of one of its alleles usually are non-informative and in some
situations are considered monomorphic. Therefore, they can be removed.
The same trend is applied for missing data. Thus, the QC process is made
through MAF
and call.rate
.
Along with removing non-informative markers, imputation of missing
data becomes necessary. We present two types of map-independent
imputation. One is based on the Wright equilibrium. For a missing
position, we assume that the probability of taking up value is dependent
on both allele frequency of the SNP and the level of homozygosity of an
individual. Thus: $$P(x_{ij})=\left\{
\begin{array}{ll}
P(x = 0) = (1 - p_j)^2+ p_j (1 - p_j ) F_i\\
P(x = 1) = 2p_j (1 - p_j )-2p_j (1 - p_j)\\
P(x=2)= p_j^2+p_j (1 - p_j ) F_i
\end{array}
\right.$$ Where pi is the
frequency of the major allele for an SNP i, and Fj is the level
of homozygosity of an individual j estimate as a proportion of the
amount of homozygous loci relative to the total of loci. Another method
of imputation currently implemented is based on the mean of each SNP.
Each missing position of a SNP j is replaced by its mean. Thus:
p̄ = 2pj
In this section we are going to present a basic usage for
raw.data
in a maize data set. It is composed of 64 inbred
lines genotyped with 539 SNPs. First, let’s run a basic quality control
on this data set.
geno.ready <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = FALSE)
M <- geno.ready$M.clean
M[1:10,1:5]
## PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7
## A01 2 0 0 0 0
## A02 1 2 2 0 0
## A03 2 0 0 0 0
## A04 2 0 2 0 0
## A05 2 0 2 2 2
## A06 2 0 2 0 0
## A07 2 2 2 0 0
## A08 2 0 0 0 0
## A09 2 2 2 0 2
## A10 2 0 2 0 0
Above, we recode and clean the dataset according to some quality control parameters. Using a call rate of 0.95 means that it can accept markers with only a maximum of 5% of missing data. MAF of 0.1 means that markers with a minor frequency allele less than 0.1 were removed. For individuals,the threshold used for sweep.sample was 0.5, which means that samples with more than 50% of missing data were removed.
Besides the cleaned matrix, a report is also outputted as a summary on how many and what markers were removed by the steps in the quality control.
## $maf
## $maf$r
## [1] "67 Markers removed by MAF = 0.1"
##
## $maf$whichID
## [1] "PHM10750.26" "PHM12693.8" "PHM12904.7" "PHM1506.23" "PHM15871.11"
## [6] "PHM16605.19" "PHM175.25" "PHM18705.23" "PHM2159.8" "PHM2177.85"
## [11] "PHM2770.19" "PHM2773.30" "PHM3612.19" "PHM3637.15" "PHM3688.14"
## [16] "PHM3690.23" "PHM3691.15" "PHM3691.18" "PHM3931.17" "PHM424.13"
## [21] "PHM424.16" "PHM4313.17" "PHM4339.79" "PHM4349.6" "PHM4469.13"
## [26] "PHM4552.6" "PHM4662.153" "PHM4757.14" "PHM4951.8" "PHM5529.4"
## [31] "PHM5535.8" "PHM5727.5" "PHM574.14" "PHM5740.9" "PHM7616.35"
## [36] "PHM8074.6" "PHM835.25" "PHM9672.9" "PZA00103.20" "PZA00192.6"
## [41] "PZA00213.19" "PZA00216.2" "PZA00276.18" "PZA00344.10" "PZA00381.3"
## [46] "PZA00425.9" "PZA00525.17" "PZA00573.3" "PZA00615.3" "PZA00658.19"
## [51] "PZA00730.2" "PZA00804.1" "PZA00878.2" "PZA00881.1" "PZA01790.1"
## [56] "PZA02151.3" "PZA02167.2" "PZA02820.17" "PZA02850.18" "PZA02921.9"
## [61] "PZA02923.7" "PZA02949.22" "PZA02952.10" "PZA03011.6" "PZA03035.5"
## [66] "PZA03063.18" "PZA03083.7"
##
##
## $cr
## $cr$r
## [1] "151 Markers removed by Call Rate = 0.95"
##
## $cr$whichID
## [1] "PHM10404.8" "PHM10525.11" "PHM112.8" "PHM1155.14" "PHM11985.27"
## [6] "PHM12992.5" "PHM1307.11" "PHM13094.8" "PHM13639.13" "PHM13648.11"
## [11] "PHM13675.18" "PHM13823.7" "PHM1438.34" "PHM14618.11" "PHM14671.9"
## [16] "PHM1506.23" "PHM15331.16" "PHM1534.45" "PHM15449.10" "PHM15501.9"
## [21] "PHM15961.13" "PHM1684.20" "PHM1745.16" "PHM175.25" "PHM17698.8"
## [26] "PHM18513.156" "PHM1870.20" "PHM18887.12" "PHM1899.157" "PHM1932.51"
## [31] "PHM2006.57" "PHM2177.85" "PHM229.15" "PHM2343.25" "PHM2350.14"
## [36] "PHM2487.6" "PHM2749.10" "PHM3094.23" "PHM3147.18" "PHM3155.14"
## [41] "PHM3171.5" "PHM3309.8" "PHM3463.18" "PHM3512.186" "PHM3627.11"
## [46] "PHM3631.47" "PHM3668.12" "PHM3691.18" "PHM3844.14" "PHM3852.15"
## [51] "PHM3963.33" "PHM4196.27" "PHM4348.16" "PHM4662.153" "PHM4818.15"
## [56] "PHM4880.179" "PHM4905.6" "PHM5296.6" "PHM5480.17" "PHM5572.19"
## [61] "PHM5622.21" "PHM563.9" "PHM5637.15" "PHM5798.39" "PHM5805.19"
## [66] "PHM5822.15" "PHM597.18" "PHM662.27" "PHM7417.21" "PHM7898.10"
## [71] "PHM8074.6" "PHM9162.135" "PHM9241.13" "PHM9635.30" "PHM9676.10"
## [76] "PZA00004.2" "PZA00049.12" "PZA00058.5" "PZA00061.1" "PZA00084.2"
## [81] "PZA00099.6" "PZA00153.3" "PZA00182.4" "PZA00192.6" "PZA00216.2"
## [86] "PZA00219.7" "PZA00220.11" "PZA00289.11" "PZA00311.4" "PZA00323.3"
## [91] "PZA00334.2" "PZA00345.15" "PZA00395.2" "PZA00399.10" "PZA00405.7"
## [96] "PZA00423.16" "PZA00425.9" "PZA00439.6" "PZA00447.6" "PZA00453.2"
## [101] "PZA00463.3" "PZA00492.26" "PZA00516.3" "PZA00524.2" "PZA00525.17"
## [106] "PZA00562.4" "PZA00588.2" "PZA00636.6" "PZA00653.5" "PZA00672.6"
## [111] "PZA00684.12" "PZA00714.1" "PZA00725.4" "PZA00730.2" "PZA00881.1"
## [116] "PZA00934.2" "PZA00941.2" "PZA01073.1" "PZA01280.2" "PZA01327.1"
## [121] "PZA01342.2" "PZA01359.1" "PZA01451.1" "PZA01462.1" "PZA01623.3"
## [126] "PZA01713.4" "PZA01715.1" "PZA01765.1" "PZA01810.2" "PZA01887.1"
## [131] "PZA02049.1" "PZA02247.1" "PZA02249.4" "PZA02462.1" "PZA02478.7"
## [136] "PZA02688.2" "PZA02727.1" "PZA02731.1" "PZA02746.2" "PZA02820.17"
## [141] "PZA02872.1" "PZA02890.4" "PZA02949.22" "PZA02957.5" "PZA02958.17"
## [146] "PZA03027.23" "PZA03028.5" "PZA03036.6" "PZA03076.10" "PZA03090.31"
## [151] "PZA03659.1"
##
##
## $sweep
## $sweep$r
## [1] "2 Samples removed by sweep.sample = 0.5"
##
## $sweep$whichID
## [1] "E08" "E09"
##
##
## $imput
## [1] "No data point was imputed"
For instance, 67 and 151 markers were removed by MAF and call rate, respectively. Moreover, two individuals were removed. Furthermore, the report shows the SNP and sample IDs removed by each procedure. It is important highlight that some SNPs may fail in both controls applied. Thus, their identification will appear in both sections. Here, 204 markers were removed, which 14 did not attend either QC applied.
Now, the quality control can be combined with imputation. When using the imputation, it is needed to choose between the supported methods.
geno.ready2 <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = TRUE, imput.type = "wright", outfile = "012")
Mwrth <- geno.ready2$M.clean
Mwrth[1:10,1:5]
## PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7
## A01 2 0 0 0 0
## A02 1 2 2 0 0
## A03 2 0 0 0 0
## A04 2 0 2 0 0
## A05 2 0 2 2 2
## A06 2 0 2 0 0
## A07 2 2 2 0 0
## A08 2 0 0 0 0
## A09 2 2 2 0 2
## A10 2 0 2 0 0
Regarding the Wright’s method, its accuracy is about 62%. Imputation based on mean also have intermediate accuracy. These methods are less accurate than those map-dependent methods, such as the one developed in BEAGLE (Browning and Browning, 2016). However, as described by Rutkoski et al. (2013) with lower missing data rates allowed a mere imputation is enough.
Concerning the output formats, the outfile
argument
is used to export the cleaned matrix in the suitable format. Default is
the count of reference allele such that AA = 2, Aa = 1, aa = 0. Another format is
coded as -1, 0, 1, case considering pj as 0.5.
Finally, a special format is suitable for use in the
STRUCTURE
software (Pritchard et al.,
2000). To generate this output, is necessary the raw data be coded
as nitrogenous bases.
geno.readySTR <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = FALSE, outfile = "structure")
Mstr <- geno.readySTR$M.clean
Mstr[1:10,1:5]
## PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7
## A01 2 3 4 3 4
## A01 2 3 4 3 4
## A02 2 2 3 3 4
## A02 3 2 3 3 4
## A03 2 3 4 3 4
## A03 2 3 4 3 4
## A04 2 3 3 3 4
## A04 2 3 3 3 4
## A05 2 3 3 1 2
## A05 2 3 3 1 2
In this output, each sample is split into two rows, one for each
allele. Nitrogenous bases are then recoded to a specific number such
that A is 1, C is 2, G is 3 and T is 4 and missing data are assigned as
-9. Also, given that STRUCTURE
can handle missing data,
arguments related to imputation are ignored when this output is
selected.
Genomic relationship matrices (GRM) are created through the
G.matrix
function. Genomic prediction models use these
matrices, especially in the G-BLUP. Different kinship parametrizations
were proposed with the aim of increasing the accuracy of prediction of
genomic selection. G.matrix
function estimates four types
of additive and one dominant genomic relationship matrix.
The matrix used in the data entry should be coded as allele count,
where AA = 2, Aa = 1, aa = 0. However, it accepts
continuous values for some SNPs. The second argument is the method to be
used to generate the GRM. There are four methods to generate it, three
forms of GRM and the Gaussian kernel (GK). Methods currently implemented
are the one proposed by VanRaden (2008), two methods
proposed by Yang et al. (2010), the UAR (Unified
Additive Relationship) and adjusted UAR, and the Gaussian kernel (Pérez-Elizalde et al., 2015).
## A01 A02 A03 A04 A05
## A01 0.94726809 -0.16011571 -0.04074426 -0.06898057 -0.10208522
## A02 -0.16011571 0.95804332 0.01897392 -0.06962968 -0.01822012
## A03 -0.04074426 0.01897392 0.92312117 -0.03879693 -0.01153428
## A04 -0.06898057 -0.06962968 -0.03879693 0.92299135 0.08498848
## A05 -0.10208522 -0.01822012 -0.01153428 0.08498848 1.04593291
## A01 A02 A03 A04 A05
## A01 0.8992808 0.4032560 0.4343691 0.4178845 0.5012764
## A02 0.4032560 0.9419096 0.3841906 0.4090457 0.4816250
## A03 0.4343691 0.3841906 0.8476102 0.4048261 0.4843004
## A04 0.4178845 0.4090457 0.4048261 0.8454841 0.5108710
## A05 0.5012764 0.4816250 0.4843004 0.5108710 1.0435704
Just for the vanRaden method, as shown above, two matrices additive and dominance are generated. Otherwise, only the additive matrix is outputted.
## A01 A02 A03 A04 A05
## A01 1.85023583 -0.256096924 -0.072284174 -0.1401453 -0.157504431
## A02 -0.25609692 1.827051568 -0.002434675 -0.1079725 -0.062167032
## A03 -0.07228417 -0.002434675 1.838119031 -0.0706931 -0.005500161
## A04 -0.14014529 -0.107972530 -0.070693097 1.8399097 0.165851895
## A05 -0.15750443 -0.062167032 -0.005500161 0.1658519 1.938086241
Two forms of output are generated, one as a matrix of order n × n.
## [1] 62 62
Another output is the long format, such that the inverse is used to
create a table in a suitable format for ASREML-R
, where
three columns are representing the row, columns and respective value of
the lower diagonal matrix.
## Warning in posdefmat(Ga): The matrix was adjusted for the nearest positive
## definite matrix
## row column value
## 1 1 1 202585.6
## 63 2 1 213470.1
## 64 2 2 224940.7
## 125 3 1 214852.5
## 126 3 2 226397.8
## 127 3 3 227986.0
These two forms are suitable to use in many aplications like BGLR (Pérez and De Los Campos, 2014), rrBLUP (Endelman, 2011) and ASREML-R (Butler et al., 2009).
The popgen
function aims to produce a summary about
population genetics parameters. Thus, for any marker locus j with alleles A1 and A2, the function
estimates:
$\begin{aligned}
f(A_1) = p_j = \frac{nA_1}{2N} \\
\\
f(A_2) = q_j = 1 - p_j
\end{aligned}$
maf = min(pj, qj)
$\begin{aligned}
H_o=\frac{nH_j}{N}
\end{aligned}$
He = 2pjqj
DG = 1 − pj2 − qj2
PIC = 1 − (pj2 + qj2) − (2pj2qj2)
$\begin{aligned}
H_o=\frac{nNA_j}{N}
\end{aligned}$
$\begin{aligned}
\chi^2=\frac{1}{d}\sum_{k=1}^{3} \frac{(O_k - E_k)^2}{E_k}
\end{aligned}$
where nA1 is the number of copies of A1 allele in the population, N is the number of individuals, nHj is the number of heterozygous genotypes (of type A1A2 or A2A1) in the locus j, Ok is the observed values for the genotypes 0, 1 and 2, Ek is the expected values for 0 = N * (1 − pj)2, 1 = N * 2pj(1 − pj) and 2 = N * pj2.
Moreover, for any individual i, the function provides estimates of:
$\begin{aligned} H_{o_i} = \frac{nH_i}{m} \end{aligned}$
$\begin{aligned} F_i=\frac{O(H_i)-E(H)}{m-E(H)} \end{aligned}$
where nHi is the number of heterozigous genotypes (of type A1A2 or A2A1) in the individual i, m is the number of markers, O(Hi) is the observed homozygosity for individual i, E(H) = ∑j1 − 2pj(1 − pj) is the expected homozygosity across all snps.
Then, based on the estimates described above, populational parameters, means and bounders of DG, PIC, MAF, Ho, F and S are provided. Besides, measures of variability are estimated:
$
$\begin{aligned} Va = \sum_{j=1}^{m} 2p_jq_j \end{aligned}$
$\begin{aligned} Vd = \sum_{j=1}^{m} (2 p_{j} q_{j})^2 \end{aligned}$
where $\bar{F_i} = \frac{\sum_{i=1}^{N} F_i}{N}$ is the mean of Fi
In general, data set needs to be coded as allele count and
missing data can be accepted. Estimates for the whole population is
presented in a list. Thus, there are estimates for
Genotype
, Markers
, Population
and
Variability
.
## p q MAF He Ho GD PIC Miss chiSq pval
## PHM10225.15 0.90 0.10 0.10 0.19 0.02 0.19 0.17 0 51.802 6.138145e-13
## PHM10321.11 0.34 0.66 0.34 0.45 0.03 0.45 0.35 0 53.392 2.731601e-13
## PHM10525.9 0.74 0.26 0.26 0.38 0.00 0.38 0.31 0 62.000 3.434573e-15
## PHM11000.21 0.15 0.85 0.15 0.25 0.03 0.25 0.22 0 46.930 7.356560e-12
## PHM11114.7 0.44 0.56 0.44 0.49 0.03 0.49 0.37 0 54.131 1.875179e-13
## PHM11226.13 0.43 0.57 0.43 0.49 0.02 0.49 0.37 0 57.981 2.646319e-14
## Ho Fi
## A01 0.02 0.938
## A02 0.06 0.838
## A03 0.01 0.962
## A04 0.01 0.962
## A05 0.03 0.923
## A06 0.03 0.915
## mean lower upper
## GD 0.39 0.17 0.50
## PIC 0.31 0.16 0.38
## MAF 0.30 0.10 0.50
## Ho 0.04 0.00 0.14
## F 0.91 0.63 1.00
## estimate
## Ne 34.10
## Va 130.15
## Vd 53.84
## number.of.genotypes 62.00
## number.of.markers 335.00
The popgen
function also allows the assignment of
subpopulations to individuals. It estimates the same population genetic
parameters for each subpopulation, such as effective population size,
components of additive and dominance variance, and endogamy. In our
example, let’s split the whole population into two subpopulations
according to the nitrogen use efficiency (NE), the high and low one.
subgroups <- as.matrix(c(rep("HNE", 10), rep("LNE", 52)))
pop.gen <- popgen(M = Mwrth, subgroups = subgroups)
It produces for each group the same estimates described above. For example, in the HNE group:
## p q MAF He Ho GD PIC Miss chiSq pval
## PHM10225.15 0.95 0.05 0.05 0.10 0.1 0.10 0.09 0 0.028 0.867814108
## PHM10321.11 0.30 0.70 0.30 0.42 0.0 0.42 0.33 0 10.000 0.001565402
## PHM10525.9 0.70 0.30 0.30 0.42 0.0 0.42 0.33 0 10.000 0.001565402
## PHM11000.21 0.10 0.90 0.10 0.18 0.0 0.18 0.16 0 10.000 0.001565402
## PHM11114.7 0.20 0.80 0.20 0.32 0.0 0.32 0.27 0 10.000 0.001565402
## PHM11226.13 0.55 0.45 0.45 0.50 0.1 0.50 0.37 0 6.368 0.011621498
## Ho Fi
## A01 0.02 0.931
## A02 0.06 0.819
## A03 0.01 0.957
## A04 0.01 0.957
## A05 0.03 0.914
## A06 0.03 0.905
and for LNE group:
## p q MAF He Ho GD PIC Miss chiSq pval
## PHM10225.15 0.88 0.12 0.12 0.20 0.00 0.20 0.18 0 52.000 5.550063e-13
## PHM10321.11 0.35 0.65 0.35 0.45 0.04 0.45 0.35 0 43.539 4.156341e-11
## PHM10525.9 0.75 0.25 0.25 0.38 0.00 0.38 0.30 0 52.000 5.550063e-13
## PHM11000.21 0.15 0.85 0.15 0.26 0.04 0.26 0.23 0 37.771 7.954843e-10
## PHM11114.7 0.48 0.52 0.48 0.50 0.04 0.50 0.37 0 44.297 2.821839e-11
## PHM11226.13 0.40 0.60 0.40 0.48 0.00 0.48 0.37 0 52.000 5.550063e-13
## Ho Fi
## A11 0.01 0.962
## A12 0.02 0.954
## B01 0.02 0.954
## B02 0.07 0.808
## B03 0.09 0.762
## B04 0.04 0.892
Moreover, the separation by groups allows the identification of exclusive/absent and fixed alleles in the assigned sub-populations. Which are in the HNE:
## [1] "There are no exclusive alleles for this group"
## [1] "PHM15278.6" "PHM1576.25" "PHM1962.33" "PHM2518.28" "PHM2672.19"
## [6] "PHM3402.11"
and in the LNE:
## [1] "PHM15278.6" "PHM1576.25" "PHM1962.33" "PHM2518.28" "PHM2672.19"
## [6] "PHM3402.11"
## [1] "There are no fixed alleles for this group"
As can be noted, in the HNE group there are many fixed alleles and no particular alleles. On the other hand, in LNE, the opposite was observed. Hence, for two populations, considering that quality control was made and monomorphic markers were removed, if one allele is fixed in one population, the other will be exclusive in another subpopulation.
In the presence of subgroups, popgen
also produces the
Wright’s F statistics in order to measure genetic divergence between
populations. FIS
measures the deficiency of average heterozygotes in each population,
FST is
related to the degree of gene differentiation among populations and
FIT is
the deficiency of average heterozygotes in subpopulations. The
parameters are estimated as follows:
$\begin{aligned} F_{IS} = 1 - \dfrac{H_I}{H_S} \end{aligned}$
$\begin{aligned} F_{ST} = 1 - \dfrac{H_S}{H_T} \end{aligned}$
$\begin{aligned} F_{IT} = 1 - \dfrac{H_I}{H_T} \end{aligned}$
where HI is the weighted average of observed heterozygosity in the subpopulations, HS is the average expected heterozygosity estimated from each subpopulation and HT is the expected heterozygosity in the total population. Besides the estimates of F parameters for each marker locus, estimates considering means of HI, HT and HS for each population is produced. Thus, There is a comparison considering all population and in a pairwise format which each pair of populations are compared against each other.