

(1) A high-quality ‘haplotype scaffold’ was constructed using statistical methods applied to SNP microarray genotypes (black circles) and, where available, genotypes for first degree relatives (available for ∼52% of samples Supplementary Table 11) 38.

To construct high quality haplotypes that integrate multiple variant types, we adopted a staged approach 37.

Furthermore, we estimate heterozygous genotype accuracy at 99.4% for SNPs and 99.0% for indels ( Supplementary Table 4), a threefold reduction in error rates compared to our previous release 2, resulting from the larger sample size, improvements in sequence data accuracy, and genotype calling and phasing algorithms. At lower frequencies, comparison with >60,000 European haplotypes from the Haplotype Reference Consortium 9 suggests 75% power to detect SNPs with frequency of 0.1%.

For structural variants, additional orthogonal methods were used for confirmation, including microarrays and long-read sequencing, resulting in FDR 95% and >80%, respectively, for variants with sample frequency of at least 0.5%, rising to >99% and >85% for frequencies >1% ( Extended Data Fig. To control the false discovery rate (FDR) of SNPs and indels at 30×) PCR-free sequence data generated for one individual per population. These novel variants especially enhance our catalogue of genetic variation within South Asian (which account for 24% of novel variants) and African populations (28% of novel variants). The project has now contributed or validated 80 million of the 100 million variants in the public dbSNP catalogue (version 141 includes 40 million SNPs and indels newly contributed by this analysis). Overall, we discovered, genotyped, and phased 88 million variant sites ( Supplementary Table 3). Construction of haplotypes started with estimation of long-range phased haplotypes using array genotypes for project participants and, where available, their first degree relatives continued with the addition of high confidence bi-allelic variants that were analysed jointly to improve these haplotypes and concluded with the placement of multi-allelic and structural variants onto the haplotype scaffold one at a time ( Box 1). Variant discovery used an ensemble of 24 sequence analysis tools ( Supplementary Table 2), and machine-learning classifiers to separate high-quality variants from potential false positives, balancing sensitivity and specificity. An overview of the sample collection, data generation, data processing, and analysis is given in Extended Data Fig. In contrast to earlier phases of the project, we expanded analysis beyond bi-allelic events to include multi-allelic SNPs, indels, and a diverse set of structural variants (SVs). This provided a cost-effective means to discover genetic variants and estimate individual genotypes and haplotypes 1, 2. In addition, individuals and available first-degree relatives (generally, adult offspring) were genotyped using high-density SNP microarrays. All individuals were sequenced using both whole-genome sequencing (mean depth = 7.4×) and targeted exome sequencing (mean depth = 65.7×). 1a see Supplementary Table 1 for population descriptions and abbreviations). In this final phase, individuals were sampled from 26 populations in Africa (AFR), East Asia (EAS), Europe (EUR), South Asia (SAS), and the Americas (AMR) ( Fig. This resource provides a benchmark for surveys of human genetic variation and constitutes a key component for human genetic studies, by enabling array design 3, 4, genotype imputation 5, cataloguing of variants in regions of interest, and filtering of likely neutral variants 6, 7. The 1000 Genomes Project has already elucidated the properties and distribution of common and rare variation, provided insights into the processes that shape genetic diversity, and advanced understanding of disease biology 1, 2.
