Skip to main content

Analysis of population structure and genetic diversity in a Southern African soybean collection based on single nucleotide polymorphism markers


Soybean is an emerging strategic crop for nutrition, food security, and livestock feed in Africa, but improvement of its productivity is hampered by low genetic diversity. There is need for broadening the tropical germplasm base through incorporation and introgression of temperate germplasm in Southern Africa breeding programs. Therefore, this study was conducted to determine the population structure and molecular diversity among 180 temperate and 30 tropical soybean accessions using single nucleotide polymorphism (SNP) markers. The results revealed very low levels of molecular diversity among the 210 lines with implications for the breeding strategy. Low fixation index (FST) value of 0.06 was observed, indicating low genetic differences among populations. This suggests high genetic exchange among different lines due to global germplasm sharing. Inference based on three tools, such as the Evanno method, silhouette plots and UPMGA phylogenetic tree showed the existence of three sub-populations. The UPMGA tree showed that the first sub-cluster is composed of three genotypes, the second cluster has two genotypes, while the rest of the genotypes constituted the third cluster. The third cluster revealed low variation among most genotypes. Negligible differences were observed among some of the lines, such as Tachiyukata and Yougestu, indicating sharing of common parental backgrounds. However large phenotypic differences were observed among the accessions suggesting that there is potential for their utilization in the breeding programs. Rapid phenotyping revealed grain yield potential ranging from one to five tons per hectare for the 200 non-genetically modified accessions. Findings from this study will inform the crossing strategy for the subtropical soybean breeding programs. Innovation strategies for improving genetic variability in the germplasm collection, such as investments in pre-breeding, increasing the geographic sources of introductions and exploitation of mutation breeding would be recommended to enhance genetic gain.


Soybean is an important nutritious crop used for food, feed and industrial oils, worldwide. Its high utility is explained by the high protein content of about 40% and high oil content reaching and exceeding 20% for some genotypes (Bellaloui et al. 2010; Orf 2010). In 2019, the worldwide production was over 300 million metric tons produced on 120 million hectares of land (FAOSTAT 2021), which translate to a global average yield of 2.5 tons per hectare. Production is dominated by a few countries. The world’s leading soybean producers are Brazil, United States of America, Argentina and China. Africa, contributes only 0.9% to the total world production (FAOSTAT 2021), which is negligible and does not match the regional demand for soybean products. The major producers are South Africa, Nigeria, Ghana, Uganda, Ethiopia, Zambia, Malawi and Zimbabwe. All these countries fail to meet their national demand. As a result, Africa imports soybean.

There is need to develop varieties that are highly productive and adapted to the tropical and subtropical ecologies in Africa. Efforts are underway to identify such varieties through the regional soybean breeding network that employs the Pan African Trials (PAT) under the leadership of the Soybean Innovation Lab (SIL), in collaboration with the International Institute of Tropical Agriculture (IITA), national public programs and private seed companies ( The PAT shows a general low level of productivity due to limited genetic improvements. However, genetic improvement efforts are challenged by the low genetic base of soybean (Cornelious and Sneller 2002; Lee et al. 2014; Li et al. 2013) owing to several domestication bottlenecks (Gwinner et al. 2017; Hyten et al. 2007; Rafalski 2002).

The baseline genetic diversity of the soybean germplasm pool and introductions should be established in order to devise a viable breeding strategy. Genetic improvement of any crop rests upon the diversity present within and among the breeding populations (Biyeu et al. 2010). Knowledge of genetic variability helps in selection of parental lines to be used when making crosses, establishment of core collections and enhanced utilization of the germplasm in breeding programs (Abebe et al. 2021; Bandillo et al. 2017). While there is limited diversity among cultivars within country or regional breeding programs because of sharing of common parents (Gwinner et al. 2017; Hahn and Würschum 2014; Tiwari et al. 2019), introduction of exotic germplasm plays a crucial role in widening the genetic base from which parents can be selected for use to make bi-parental crosses.

The tropical and subtropical soybean breeding programs in Africa utilizes temperate germplasm to improve local varieties. The major sources of soybean germplasm lines and populations for Africa have been China, Japan, Korea and USA (Grieshop and Fahey 2001; Jeong et al. 2019a, b) where greater genetic diversity has been reported (Oliveira et al. 2010). China and the USA maintain large collections in their gene banks. There are about thirty thousand accessions in the Chinese Gene bank, while the USDA gene banks contain about 15000 accessions (Liu et al. 2017). This germplasm cannot be directly used in the breeding programs in Africa. There is need to characterize the germplasm before crossing is done. For example, 93% of the Chinese germplasm accessions are primitive cultivars but highly diverse (Chen and Nelson (2005). These collections are important sources of favorable alleles which can enhance breeding in Africa. However, when such introductions are to be used for breeding purposes, they need to be screened for their usefulness (Jeong et al. 2019a, b; Li et al. 2014) and inform the breeding strategy.

A survey of the literature indicates that germplasm diversity characterization can be conducted following two approaches. Both morphological or phenotypic and molecular genetic diversity studies have been used to assess variation in soybean (Abebe et al. 2021; Bandillo et al. 2015; Chander et al. 2021; Malik et al. 2011; Jeong et al. 2019a, b; Ma et al. 2006; Nawaz et al. 2021; Ojo et al. 2012; Valliyodan et al. 2021; Wang et al. 2012; Mihaljević et al. 2020). The advantages and limitations of both approaches have been discussed.

While morphological or phenotypic methods have been successful for discriminating soybean genotypes, their efficiency is compromised by complications which are caused by the genotype by environment interactions (GxE) effects. GxE masks genotypic differences among the germplasm entries. The high levels of GxE effects requires that genotypes are evaluated at many sites. However, due to the exorbitant costs for conducting multi-location trials, a few sites are often used resulting in a low resolution due to few data points. There are also challenges of waiting for a long time to get results. The length of the cycle from seed to seed is a hindrance as it is time consuming, labor intensive and costly (Chander et al. 2021; Nadeem et al. 2018). As a result, use of molecular markers has increased. They are not affected by GxE interactions, not growth specific and are abundant within the genome (Nadeem et al. 2018). Although molecular markers were initially expensive, there have been improvements such as invention of single nucleotide polymorphism (SNPs) DNA markers and their amenability to automation that have brought the costs per data point to a very competitive level compared to phenotypic data. Currently, SNPs are among the most widely used markers (Zhu et al. 2003; Edwards et al. 2007; Nadeem et al. 2018).

The SNPs are the markers of choice for molecular diversity studies. SNP markers have been successfully used for diversity studies for several crops including soybean (Abebe et al. 2021; Chander et al. 2021; Liu et al. 2017), cowpea (Fatokun et al. 2018; Qin et al. 2016; Sodedji et al. 2021), pigeon pea (Yang et al. 2006; Zavinon et al. 2020) and common bean (Blair et al. 2013; Cortés et al. 2011; Nemlı et al. 2017). Assessment of the genetic diversity among elite lines and varieties developed by IITA using SNPs revealed high diversity within the germplasm and grouped the germplasm into three clusters based on genetic relatedness (Abebe et al. 2021). Similarly, broad genetic base among tropical soybean lines with a genetic diversity index of 0.414 using SNP markers has been reported (Chander et al. 2021). However, previous studies cited low genetic diversity among the germplasm from Brazil, China, Europe and North America. Low genetic diversity was reported among Brazilian (Gwinner et al. 2017), USA and Chinese germplasm (Liu et al. 2017). Central European lines were reported to be closely related to the Swiss and Canadian lines, but distantly related to the Chinese (Hahn and Würschum 2014). These findings suggest the need for breeders to know the molecular diversity in the germplasm to guide breeding strategies.

Improvement of soybean varieties for adaptation and productivity ranks quite high on the product profile for the Southern Africa region. Early maturity in response to climate change, which has rendered growing seasons short, is one of the important traits for soybean lines for deployment in sub-Saharan Africa (Ziervogel et al. 2014). This requires sourcing of exotic germplasm with the favorable alleles for early maturity. Temperate germplasm is less sensitive to latitude, which is a major determinant of flowering and maturity time in soybean. The soybean breeding programs in Africa have collected both temperate and tropical germplasm for utilization in breeding. However, the levels of molecular diversity in this collection has not been established. The present study was therefore conducted to assess the population structure and genetic diversity of the temperate and tropical soybean accessions using SNP markers.

Materials and methods

Plant material and sampling

Public (belonging to government/ national research institutions) and private (from private institutions) germplasm collection which comprised 210 lines from South Africa (10), Malawi (1), Zimbabwe (19), and USA (180) was used for the study. All the genotypes were planted in plastic sleeves in a screen house in 2019. The 10 genotypes from South Africa were planted in South Africa while the other 200 were planted in Zimbabwe. An average of six leaf discs was sampled from a single plant from each of the genotypes at 3 weeks after emergence using the LGC genomics plant sample collection kit. The leaf discs were placed in 96 well plates and sealed with perforated strip caps. A desiccant sachet was placed on top of the sealed tubes and a rack lid was fixed on top. The samples were placed in a sealable bag and shipped to LGC genomics, Germany, for genotyping using the targeted genotyping-by-sequencing (SeqSNP) method.

Rapid phenotypic screening

A total of 200 non-genetically modified accessions (temperate and Tropical) were planted in Zimbabwe. The ten accessions from South Africa could not be evaluated in Zimbabwe because they are genetically modified (contain the Roundup-ready herbicide resistance trait). The rapid screening was conducted at the Rattray Arnold Research Station (RARS) (17038′60" S 31014′24"E), near Harare. Rapid phenotypic screening for yield was done in an observation trial without replication in two row plots which were 1.5 m long and a spacing of 0.45 m inter row and 0.05 within row. Grain yield was recorded from the whole plot at maturity.

DNA extraction, SNP marker genotyping and data pre-processing

DNA extraction was done using magnetic bead chemistry (sbeadex mini plant kit from LGC, Biosearch Technologies, Berlin, Germany) on KingFisher Flex. SNP marker genotyping was performed using SeqSNP, a targeted genotyping by sequencing service offered by LGC, which allows for genotyping of SNPs and small insertions/deletions using a single primer enrichment technology (LGC Bioscience Technologies 2019). In order to design a SeqSNP assay, a total of 500 informative markers were selected from a panel of 1 082 markers in the LGC database (, which were designed from an original set of 1 536 SNP markers, the “Universal Soy Linkage Panel” (USLP 1.0) described in Hyten et al. 2010. These SNP markers were selected based on the even distribution throughout each of the 20 consensus linkage groups, and for optimum allele frequency in diverse germplasm. The physical starting and end positions of the markers for the construction of a BED file for use in sequencing were taken from the Soybase database ( with the reference genome as Williams 82.

The total number of targets that passed design was 496 covered by a total of 984 oligo probes, i.e. the number of oligo probes per target being ~ 1.98. The total number of targets which passed the quality criteria, that is, those that were successfully genotyped in at least 85% of all samples, was 485 (97.8%). NextSeq 500 sequencing was performed, with the number of pre-processed reads being 35 397 796 reads which is approximately 168 561 reads per sample. The percentage reads effectively used in genotyping was 83.4% and the average effective target SNP coverage 283x. The SNP genotyping pipeline and settings involved diploid genotyping with minimum coverage of 8 reads per sample and locus using Free Bayes (Garrison & Marth 2012). A total of 437 (87.1%) of the targets were polymorphic, 98.5% of all calls were homozygous and 1.5% heterozygous. Missing data was reported with 1.4%.

Demultiplexing of all library groups was done using the Illumina bcl2fastq software. One or two mismatches or Ns were allowed in the barcode read when barcode distances between all libraries on the lane allowed for it. Clipping of sequencing adapter remnants was then done from all reads. Reads with final length  < 65 bases were discarded. Quality trimming of adapter clipped illumina reads was performed for the removal of reads containing Ns and trimming of reads at 3` end to get a minimum average Phred quality score of 30 over a window of ten bases. Reads with final length  < 65 bases were discarded. FastQC reports for all FASTQ files were then created. Read counts containing all read counts for all samples at a glance were then generated.

Data analysis

Alignment of quality trimmed reads against target genome using Bowtie2 was done followed by variant discovery and genotyping of samples with Freebayes V1.0.2–16 ( Ploidy was set at 2 and genotypes were filtered for a minimum coverage of 8 reads. SNP marker diversity and profile were analyzed using the Powermarker and GenAlEx software. SNP data quality check was done by filtering, where SNPs with call rate greater than 90% were retained and those with minor allele frequency (MAF) of  < 0.05 were discarded. The polymorphic information content (PIC), observed heterozygosity (Ho), expected heterozygosity (He), allele frequency and Shannon Information Index (I) were computed in Powermaker (Liu and Muse 2005) and GenAlEx (Peakall and Smouse 2012).

Genetic diversity analyses were conducted using the R software. The genotypes were subjected to Silhouette plot analysis in R Statistics 3.5.1 version (Team R Core 2015) to determine the probable number of clusters formed. Coefficients of similarity showing genetic distances among the soybean lines (Matrix of similarities) were calculated in R Statistics following the Gower’s Distance model (Gower 1971). The similarity matrix was then used to group the soybean genotypes using the Unweighted Pair Group Method using Arithmetic average (UPGMA) algorithm in R Statistics (Team R Core 2015) giving an annotated phylogenetic tree (Rambaut 2016). The 30 tropical and 180 temperate genotypes were isolated and subjected to diversity analysis and a Dendogram was drawn in R Statistics separately for each group of genotypes.

Population structure analysis was performed using the Bayesian clustering approach in STRUCTURE v2.3.4 (Porras-Hurtado et al. 2012). Structure analysis was run using an Admixture model with 5 000 burning period and 50 000 Markov-chain Monte Carlo replications. The number of clusters (k) was set to range from 1 to 10 with 3 iterations. The output from STRUCTURE was then imported to Structure harvester (Earl and VonHoldt 2012) to visualize the delta K value which forms a distinct peak, using the Evanno Method. Analysis of molecular variance (AMOVA) was done using GenAlEx (Peakall and Smouse 2012) to determine the variance components and the molecular diversity between and within populations. Bases were coded A = 1, C = 2, G = 3, T = 4 and missing data 0. Clone Identification was also done in GenAlEx. The Nei’s nucleotide distance and the fixation Index (FST) were also computed. The fixation index is a measure of genetic variation that can be explained by population structure and ranges from 0 (identical) to 1 (completely different with no common alleles shared) (Mohammadi and Prasanna 2003) calculated as;

$$\mathrm{FST }= \frac{{\delta }_{s}^{2}}{\overline{p }\left(1-\overline{p }\right)}$$

where \({\delta }_{s}^{2}\) is the variance in the frequency of the allele between different subpopulations, weighted by the sizes of the subpopulations, and \(\overline{p }\) is the average frequency of an allele in the total population.


Phenotypic yield data

The yield data showed that the tropical lines yielded more than the temperate genotypes in Zimbabwe. The top ten performing genotypes were all tropical genotypes while all the bottom 10 were temperate genotypes (Table 1). The frequency of the performance data of the genotypes is shown in Fig. 1. Only 15 genotypes were able to give yield that was above 4000 kg/ha and these were mainly of tropical origin. Out of the 49 genotypes which yielded between 3000 and 4000 kg/ha, 46 are of temperate origin. Most of the genotypes (70) were in the yield range of 2000–3000 kg/ha while no genotype gave a yield that was below 1000 kg/ha (Fig. 1).

Table 1 Top ten and bottom ten yield data for the soybean genotypes evaluated in Zimbabwe
Fig. 1
figure 1

Frequency distribution of 200 non-genetically modified soybean genotypes for grain yield

SNP marker diversity and profile

After filtering, 403 SNP markers remained with minor allele frequency  > 0.05. The SNP marker profiles are presented in Table 2. The average minor allele frequency was 0.24. The number of alleles ranged from 1 to 3 with an average of 1.88. The Shannon Information index ranged from 0.03 to 0.98 with a mean of 0.45. The mean expected heterozygosity (He) was 0.31, whilst the mean observed heterozygosity was 0.02. The mean polymorphic information content (PIC) was 0.24.

Table 2 SNP marker diversity for genotyping 210 diverse temperate and tropical soybean lines

Population structure

The silhouette plots showed that considering two clusters will produce one genotype with a negative silhouette value (Fig. 2a). When three clusters were considered, all the genotypes fitted perfectly into the three clusters (Fig. 1b). Having more clusters produced several genotypes with negative values on the silhouette plots. Therefore, three clusters were perfect in grouping all the genotypes (Fig. 2b) thus three clusters were the best fit for all genotypes. In the first cluster, 205 individuals were identified whilst cluster two and three had three and two lines, respectively. The average genetic distances (GD) were 0.28, 0.11 and 0.13 for the Clusters 1, 2 and 3, respectively.

Fig. 2
figure 2

Silhouette plots showing the number of possible clusters formed from 210 genotyped soybean lines a. considering 2 clusters b. considering 3 clusters

According to the Gower’s genetic distances calculated in R statistics, all the 210 genotypes were also grouped into three clusters as shown in the phylogenetic tree drawn using UPGMA cluster analysis (Fig. 3). The first cluster consisted of three temperate genotypes, Nitchuu 47, Tara and Tousan, while the second cluster consisted of two lines, namely Forrest and Fowler. The five genotypes in cluster one and two are all from USA. The third cluster consisted of 205 genotypes. The genotypes in this cluster consisted of all tropical genotypes from Zimbabwe, South Africa, Malawi and several temperate genotypes from the USA. There were genotypes which had short genetic distances (Fig. 3) between them such as Pudou 426 and Usada Zairai (0.02); Yougestu and Tachiyukata (0.02), UI. San and IC. San (0.05), Saga and Santee (0.07), Stanza and Mwenezi (0.08). Most of the lines from Zimbabwe are fitted in the third cluster. Three of the South African genotypes clustered together. Several USA genotypes also clustered close to each other.

Fig. 3
figure 3

UPMGA phylogenetic tree showing three clusters for all the 210 soybean lines drawn using the Gower’s similarity distances

When only tropical lines were analysed three clusters were formed where all the Zimbabwean lines clustered together in the first cluster, while all the South African lines also clustered together in the second cluster (Fig. 4). The third cluster had Tikolore, the only line from Malawi. Sister lines clustered close to each other, for example S1440-5-1E and S1440-5-2E, as well as LDC-5-3 and LDC-5-9. Shortest genetic distance existed between Stanza and Mwenezi (0.08) and Solitaire and Pan 1867 with a genetic distance of 0.09. Greatest genetic distances were observed between Tikolore and Stanza (0.24), Tikolore and Mwenezi (0.17) and Tikolore and Serenade (0.12).

Fig. 4
figure 4

Dendogram showing clustering of the 30 tropical soybean lines

A UPMGA phylogenetic tree for temperate genotypes only is shown in Fig. 5. While this tree shows three clusters for these lines, the same lines that clustered close together when all 210 lines were included (including temperate lines), still clustered close to each other when these temperate lines were used in the analysis. Most of the LD lines clustered together just like when the temperate lines and tropical lines where used. Moreso, lines like Benning and Bingnan, Yougestu and Tachiyutaka and IC-San and UI-San clustered close to each other with short genetic distances of 0.08, 0.02 and 0.05, respectively.

Fig. 5
figure 5

UPMGA phylogenetic tree showing clusters of the 180 temperate soybean lines only

The Evanno method was used to reveal the optimum k value for the genotyped soybean lines in STRUCTURE Harvester. The results of delta k (∆k) curve show that the k peaked at 3 with a mean value of ln likelihood of -46516.5 and variance of ln likelihood of 3407.0 meaning a total of three clusters or subpopulations contributed to the total variation in the soybean lines under study (Fig. 6).

Population structure was constructed to reveal the architecture within the population. In agreement with the Evanno method, three sub populations were recognised (Fig. 7). Each of the colors (red, green and blue) in the population struture represents each cluster. The lines Fowler and Forrest (188 and 180 respectively) clustered close to each other while these are also closely clustered to Tousan (102), Tara (147) and Nutchu 47 which were in another cluster according to the UPMGA. Several other genotypes consisted of genomes made of at least two of the subpopulations (Fig. 7).

Fig. 6
figure 6

Graph showing the best k value using the Evanno method


Clone analysis was done in GenAIEx to identify duplications. Table 3 shows the results. Two groups of duplicates were identified. Pudou-426 and Usada-Zairai were identified as duplicates while Tachiyukata and Yougestu were also identified as duplicates. The duplicate groups were labeled as A and B, respectively.

Table 3 Duplications of the soybean lines derived from clone analysis
Fig. 7
figure 7

Population structure of the 210 soybean lines

Genetic diversity among soybean lines

Analysis of molecular variance (AMOVA) was performed using the GenAIex for the three subpopulations identified in STRUCTURE. The AMOVA showed that total variation within the population can be partitioned into among- and within population sources, accounting for 4% and 96% of the total variation, respectively (Table 4). The FST value of 0.06 was low.

Table 4 Analysis of molecular variance (AMOVA) for the 210 soybean lines

Table 5 shows genetic variability among and within populations and the fixation index (FST) for the soybean lines. The Nei’s net nucleotide distance ranged from 0.06 between cluster 1 and cluster 2 to 0.12 between cluster 2 and cluster 3. Cluster 1 and cluster 3 had a nucleotide distance of 0.09. This means that cluster 2 and 3 were furthest apart, whereas cluster 1 and 2 were closer to each other. The least within population variation was recorded in cluster 3 with an expected heterozygosity (He) of 0.21, whilst cluster 2 had the highest within population variation of 0.31. The fixation index (FST) were 0.06 (Cluster 1), 0.29 (cluster 2) and 0.02 (cluster 3). Cluster 3 had the lowest genetic variance proportion of 0.02 (Table 5).

Table 5 Allele-frequency divergence among populations (Nei’s Net nucleotide distance) and within populations (expected heterozygosity) and Fixation Index (FST) for 210 soybean lines


Phenotypic yield data

The results showed that the tropical lines yielded more than the temperate lines which indicates the tropical lines are well adapted to the Zimbabwean environment. This is usually expected especially when lines are introduced from a different region with different environmental conditions in terms of rainfall, latitude, altitude and temperatures. While the temperate genotypes yielded less than the tropical, 46 temperate genotypes yielded relatively better above 3000 kg/ha, indicating their potential utility for tropical and subtropical breeding programs. These accessions can be utilized in soybean breeding programs for introgression of important traits, such as rust resistance and phenotypic maturity date if screened for such traits as this would reduce linkage drag effects on productivity (Abebe et al. 2021).

SNP marker diversity and profile

The SNPs used were quite informative and desirable for differentiating the soybean genotypes under study. The allelic number ranging from 1 to 3 can be attributed to the crop being self-pollinated, which is consistent with previous reports for low allelic diversity and heterozygosity levels for soybean (Abebe et al. 2021; Wright 1921). The mean minor allele frequency (MAF) value of 0.24, which is above 0 reflects the SNPs were informative. The MAF values measures the ability of markers to discriminate genotypes. With SNP markers due to their bi-allelic nature, a value above 0 is considered informative or discriminating. In the present study, 60% of the markers had a MAF between 0.3 and 0.5 which is comparable to values reported on soybean in previous studies (Chander et al. 2021; Abebe et al. 2021). The mean PIC value of 0.24 also indicates that the markers were informative. Considering the bi-allelic nature of SNPs where the PIC cannot exceed 0.5 (Singh et al. 2013), the PIC values obtained in this study were desirable for differentiating the 210 soybean genotypes. Similar results were reported in soybean by Abebe et al. (2021) who reported a mean PIC value of 0.25 among elite lines developed by the IITA. In other self-pollinated crops, Singh et al. (2013) reported a mean PIC value of 0.23 in rice. The observed heterozygosity (Ho) of 0.02 was lower than the expected heterozygosity (He) in this study. This implies high possibilities of inbreeding and fixation at most of the loci (Nawaz et al. 2021). Overall, the SNPs used in this study were informative and discriminating hence they can be recommended for diversity studies in other soybean populations.

Population structure and genetic diversity

The study was effective for determining the population structure and level of diversity in the germplasm collection. There was consistency in the outcome from the Silhouette plots, UPMGA and Evanno method in STRUCTURE used to discriminate the 210 soybean genotypes into clusters based on genetic similarity. The silhouette plots grouped the genotypes into three clusters perfectly, indicating that these were the effective number of clusters which could be formed from the germplasm used in this study. The silhouette plots are generally used to visualize how well the data points belongs to the cluster. The silhouette scores which range from -1 to 1 measure how similar an object is to its own cluster compared to other clusters (Menardi 2011; Pant et al. 2008; Rousseeuw 1987; Thinsungnoen et al. 2015). This finding was confirmed by two additional tools used in the study.

The Unweighted Pair Group Method using Arithmetic average (UPGMA) produced a phylogenetic tree with three populations which corroborated the findings from the silhouette plots and the Evanno method. While five genotypes from the USA (Nitchuu 47, Tara, Tousan, Forrest and Fowler) were grouped in clusters one and two, all other genotypes were grouped in the third cluster. The genotypes included in the third cluster were from different sources, from the USA, Zimbabwe, South Africa and Malawi. This means that there was limited molecular variation among the genotypes used in this study. This could be attributed to exchange of genetic material across the different breeding programs in the Southern Africa region and external sources from other regions, such as Asia and America. An analysis of seed shipments indicates that there is a lot of germplasm exchange between the soybean breeding programs in Southern Africa and the USA. This implies that the soybean lines were derived from shared backgrounds and were selected for the same market requirements leading to utilization of the same set of alleles. According to the literature and actual pedigree analysis of this germplasm set, most soybean lines were developed from a narrow genetic base derived from a few ancestral lines. A survey of the literature indicates extensive utilization of external germplasm from different countries, such as China, Japan and Korea (Abebe et al. 2021; Bruce et al. 2019; Jeong et al. 2019a, b; Kim et al. 2014). It is a standard and recommended industry practice for breeders to continuously incorporate and integrate external germplasm in their breeding programs.

According to the phylogenetic tree of the 210 genotypes and a separate analysis of the tropical lines only, Zimbabwean and South African lines are clustered together separately. These lines were bred to satisfy the same market requirements with common trait preferences and common allelic constitutions. Several other genotypes clustered close to each other in accordance with their origin, adding credence to the possibility of utilizing common genetic background in breeding programs. Similar results of soybean genotypes that were clustered in accordance with the place of origin have been reported (Lee et al. 2014; Liu et al. 2017). This has also been reported for other legume crops, such as cowpea (Fatokun et al. 2018; Sodedji et al. 2021) and sesame (Basak et al. 2019). In the analysis involving tropical lines only, Tikolore was classified alone in its own cluster showing its potential for use in the tropical breeding programs for introgression of important traits.

Duplications show high level of genetic similarities (Makore et al. 2021) which was revealed in this study which is consistent with the findings from the phylogenetic tree that shows low genetic distances between some lines. Seemingly, the observations of duplications and minimal genetic distances indicates that there are introductions that were given different names by different breeders.

The results from analysis of molecular variance (AMOVA) supports the possibility of high gene flow as shown by the variation among populations that accounted for just 4% of the total variation, whilst within populations variation was about 96% of the total variation. The FST value of 0.06 indicated that there is low genetic difference among populations, suggesting high gene exchange. This observation is consistent with the literature. Wang et al. (2012) reported that most populations were exhibiting the effects of genetic bottlenecks. Basak et al. (2019) also reported similar results in sesame. Abebe et al. (2021) cited moderate genetic variation and that 11% of the total variation was attributed to among clusters and 71% was due to individual genotypes and an FST value of 0.11 in soybean. Generally, low FST values close to 0 indicate that subpopulations are similar in almost all alleles or there is little divergence within the population, whilst FST value of 1 means the subpopulation is fixed at all alleles (Basak et al. 2019; Mohammadi and Prasanna 2003). In the current studies, the low FST values has an implication in breeding in that little improvement can be done through simple hybridization in some traits of economic importance, for example yield. However, the low diversity can be utilized in conservation of such important traits by crossing the related genotypes. For example, crossing genotypes within cluster 3 to maintain high yields in some of the genotypes while taking advantage of some rare or minor alleles found in other genotypes. Minor alleles that can be leveraged on in such germplasm could be for earliness found in most USA genotypes. Genotypes from cluster 2 and 3 can be hybridized for improved varieties although the improvement has a certain ceiling because of the low genetic variation within the whole germplasm used in this study.

Conclusions and recommendations

The SNP markers used were informative and displayed high discrimination capacity, hence the results from this study were useful for molecular characterization of this soybean collection in Southern Africa. The 210 germplasm lines were consistently grouped into three clusters using three tools. Low molecular diversity was evident. These findings have serious implications for the breeding programs that aim to improve soybean varieties by utilizing this germplasm collection. Innovation strategies for improving variability in the germplasm collection, such as investments in pre-breeding, increasing the geographic sources of introductions and exploitation of mutation breeding would be recommended to enhance genetic gain.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.


Download references


The authors would like to acknowledge DAAD for funding the research and Seed Co for the provision of the experimental stations for this study.


The Research was funded by German Academic Exchange Service (DAAD) as part of the PhD funding.

Author information

Authors and Affiliations



AT conceptualization of the research, field work, data analysis, writing of the original draft, reviewing and editing of the final manuscript, EG data analysis, reviewing and editing, HM reviewing and editing, JFYE supervision, reviewing and editing, PT supervision, reviewing and editing, EYD supervision and reviewing, LM reviewing and editing, MZ selection of SNP markers, reviewing and editing, EZ reviewing and editing, JD supervision, reviewing and editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to J. Derera.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tsindi, A., Eleblu, J.S.Y., Gasura, E. et al. Analysis of population structure and genetic diversity in a Southern African soybean collection based on single nucleotide polymorphism markers. CABI Agric Biosci 4, 15 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Glycine max
  • Molecular diversity
  • Phenotyping
  • Population structure
  • SNP
  • Soybean