1,000 Genomes Project (2008–2015)
The 1,000 Genomes Project, which began in 2008, was an international effort to create a detailed and publicly accessible catalog of human genetic variation to support medical studies aimed at exploring genetic contributions to disease. Project scientists sequenced the entire genomes of 2,504 individuals from around the world—more than the 1,000 originally planned. The Project extended the results of the International HapMap Project, a prior effort at cataloging human genetic variation that ran from 2002 through 2010. Whereas the HapMap identified common genetic variants, meaning specific DNA sequences present in five percent or more of individuals in a population, the 1,000 Genomes Project identified genetic variants present in as few as one percent of individuals in a population. By assembling a larger catalog of DNA sequence variation than had previously existed, the 1,000 Genomes Project paved the way for researchers to more precisely locate disease-related genetic variation passed from parent to child.
- Background and Context
- Planning the 1000 Genomes Project
- Phases of the 1,000 Genomes Project
- Results of the Project
- Legacy and Impact
Background and Context
The 1,000 Genomes Project emerged out of scientific efforts, in the wake of the completion of the Human Genome Project, to understand the genetic contributions to disease. The Human Genome Project, which began in 1990 and concluded in 2003, was an international effort to sequence, for the first time, all three billion base pairs of DNA in one human genome, as well as the genomes of several model organisms such as the fruit fly, nematode worm, and mouse. Scientists published the human genome sequence in 2003. That sequence is called the reference genome. Generated on the basis of DNA samples that scientists collected from a small number of humans, the reference genome provided researchers with a map of where genes are located, an estimate of how many genes there are in total, and a baseline from which to compare other human genome sequences. Over the course of the Human Genome Project, scientists learned that one human genome differs from another, on average, by one out of every thousand base pairs. That means that any two human genomes are approximately 99.9 percent identical. Because understanding disease risks requires understanding not only how humans are similar but how they are different, researchers increasingly became interested in exploring the 0.01 percent of DNA that differs from one person to the next. The 1,000 Genomes Project was one in a series of efforts that scientists undertook to catalog that genetic variation.
The International HapMap Project, which ran from 2002 to 2010, was one of the predecessors of the 1,000 Genomes Project. The goal of the HapMap was to assemble a map of common patterns of human genetic variation to enable future medical research, in particular, genome-wide association studies, or GWAS. GWAS are a type of analysis in which scientists compare the DNA sequences of two groups of people, one with and one without a particular trait, such as a disease, and look for genetic variations that correlate with the trait. Finding any genetic variations that are more common in the disease group compared to the healthy group suggests to researchers that those variations may be near, or may themselves be, DNA sequences that influence the disease condition.
At the time of the HapMap’s launch in 2002, sequencing an individual’s entire genome was prohibitively expensive and time consuming, and therefore doing so for many individuals to uncover genetic variation was not feasible. It took scientists of the Human Genome Project thirteen years and 2.7 billion dollars to sequence the equivalent of one human genome. Consequently, HapMap scientists sought to provide researchers with what they called a shortcut to identifying genetic variation. That shortcut involved identifying blocks of DNA sequence variations that are inherited together, called haplotypes. For many regions of a chromosome, only a handful of haplotypes exist in humans. However, using the HapMap, researchers performing a GWAS can search for correlations between those particular haplotypes and the disease condition they are interested in. Finding an association of that kind tells the researcher which haplotypes to search more closely for particular DNA sequences that may be linked to the disease. By using the HapMap as a starting point, researchers reduce the number of parts of the genome that they have to sequence in detail to look for disease associations.
A premise of the HapMap was the common disease–common variant hypothesis, which is the idea that common diseases like cancer and cardiovascular disease are influenced by many genetic variants that are relatively common in a population. By common, researchers typically mean variants present in at least five percent of individuals in a population. When scientists eventually used the HapMap to perform GWAS, however, they found that they could make limited conclusions about the disease-linked variants they discovered due to a high possibility of false positives or incorrect associations between disease and genetic contributions. Those negative findings led some scientists to conclude that the common disease, common variant hypothesis was not the sole explanation of disease risk. They surmised that some of the genetic variants promoting disease must be rare in the population, that is, below a frequency of five percent. The rationale of the 1,000 Genomes Project was to map those rarer variants, and thereby to improve the power of GWAS to uncover the genetic contributions to disease. The variants that the 1,000 Genomes Project scientists were focused on were single nucleotide polymorphisms, or SNPs, as well as other changes like insertions and deletions. A SNP is a change in one nucleotide base or letter at a particular site. Insertions and deletions are when small bits of DNA are either added or removed.
Planning the 1000 Genomes Project
Initial discussions of what became the 1,000 Genomes Project began at a May 2007 meeting of the International Human Genome Sequencing Consortium at the Cold Spring Harbor Laboratory in Cold Spring Harbor, New York. The Consortium was made up of scientists from the twenty institutions located in France, Germany, Japan, China, Great Britain, and the United States who contributed to the Human Genome Project. Richard Durbin, a scientist with the Sanger Institute in Cambridge, England, proposed that members of the Consortium plan a project to develop a comprehensive catalog of sequence variants in multiple human populations that would surpass previous efforts, including the HapMap. Because of rapid progress in the ability to sequence entire genomes quickly and cheaply, he argued, the time was right for such an effort.
Following additional preliminary discussions, some members of the Human Genome Sequencing Consortium formed a working group and came together for an official planning meeting on 17 and 18 September 2007 at the Sanger Institute in Cambridge. There, scientists discussed the technical aspects of such a sequencing effort, what DNA samples to use, and the ethical, legal, and social issues related to the project. A summary document from the 2007 meeting states that the primary goal of the project was to discover essentially all DNA variants present at frequencies as low as one percent across the genome and from 0.1 to 0.5 percent in coding regions, or genes. To accomplish their goals, the participants proposed a series of three pilot projects to evaluate different sequencing technologies, the efficiency of work processes, and data quality before undertaking the full project. For those pilot studies, the scientists proposed to test the sequencing technologies on DNA samples previously collected for the HapMap project.
The consortium of individuals behind what became known as the 1,000 Genomes Project consisted of hundreds of individual scientists from many individual institutions including universities and private companies from around the world. David Altshuler from the Broad Institute of the Massachusetts Institute of Technology and Harvard University, both in Cambridge, Massachusetts, was one co-chair of the Steering Committee of the Project. Richard M. Durbin, from the Sanger Institute, was the other co-chair of that committee. The National Human Genome Research Institute, part of the US National Institutes of Health, helped to fund and direct the effort, along with funding agencies in Britain, China, Germany, and Canada. The predicted cost of the Project was $120 million.
Phases of the 1,000 Genomes Project
The 1,000 Genomes Project took place in three phases from 2008 through 2015. For Phase I of the project, scientists sequenced the genomes of 1,092 individuals from fourteen populations around the world and published the results in 2012. Phase II of the Project was devoted to technical development, rather than data production and analysis and was not accompanied by a published scientific report. For the third and final phase, Project scientists added individuals from twelve additional populations to those from Phase I and sequenced an additional 1,412 genomes, for a total of 2,504 genomes. They published those results in 2015. As of 2024, all data from the Project are freely accessible to the public through the internet.
For both Phase I and Phase III, Project scientists had to decide which populations to include in their analysis. As the scientists stated in supplemental information accompanying each published paper, their choice of populations was based on a mix of scientific, ethical, and practical considerations. The most important scientific consideration, they stated, was to collect DNA samples from multiple yet distinct populations within a continent in an effort to be broadly representative. From an ethical perspective, the relevant considerations were having broad consent on the part of participants and the avoidance of small populations for which stigmatization or breach of privacy were risks. Lastly, from a practical point of view, the Project scientists wanted to include samples for which considerable data were already available, such as those from the HapMap, in part for quality control measures.
For Phase I, scientists relied on a combination of previously collected and newly collected blood samples from individuals in each of fourteen populations. Eight of the fourteen populations had been part of the HapMap Project, completed in 2010. Those populations were people with African ancestry in the Southwest United States, Han Chinese people in Beijing, China, Japanese people in Tokyo, Japan, Luhya people in Webuye, Kenya, people with Mexican ancestry in Los Angeles, California, Utah residents with ancestry from Northern and Western Europe, Toscani people in Italy, and Yoruba people in Ibadan, Nigeria. For those eight populations, Project scientists obtained stored HapMap samples from the Coriell Institute, a blood and tissue bank in Camden, New Jersey. The other six populations consisted of Finnish people in Finland, British people from England and Scotland, UK, Colombian people in Medellin, Colombia, Han Chinese people from South China, Iberian people from Spain, and Puerto Rican people in Puerto Rico. Of those latter six, the Finnish samples came from a researcher at the University of Helsinki, in Helsinki, Finland, and the rest were ones that Project scientists collected specifically for the 1,000 Genomes Project. Project researchers in each of the above locales oversaw the collection process, including obtaining informed consent from the participants.
For Phase III, Project scientists collected blood samples from individuals from each of twelve additional populations. Those populations were Esan people in Nigeria, Gambian people in Western Division, Mandinka, Mende people in Sierra Leone, African Caribbean people in Barbados, Peruvian people in Lima, Peru, Chinese Dai people in Xishuangbanna, China, Kinh people in Ho Chi Minh City, Vietnam, Bengali people in Bangladesh, Gujarati Indian people in Houston, Texas, Indian Telugu people in the UK, Punjabi people in Lahore, Pakistan, and Sri Lankan Tamil people in the UK.
After each series of blood collections, Project scientists sent the blood samples to the Coriell Institute, which stores the samples in the form of cell lines. All 2,504 cell lines from the 1,000 Genomes Project are stored at Coriell. No personal information is included with the samples besides sex and population of origin. Information about health and illness is not included, although those individuals who agreed to participate declared themselves to be healthy at the time of donation.
Results of the Project
The 1,000 Genomes Project Consortium reported results from Phase I of the project in October 2012 in the journal Nature. From the 1,092 individuals they studied, Project scientists identified 38 million SNPs, as well as a little more than a million insertions and deletions. They estimated that they had found ninety-eight percent of all SNPs present at a frequency of at least one percent in the populations they studied.
The consortium reported the third and final phase of results of the project in September 2015 in two articles published in the journal Nature. From sequencing the genomes of an additional 1,412 individuals for a total of 2,504 genomes, the researchers discovered a total of 84.7 million SNPs and nearly four million insertions and deletions, which they also mapped to specific haplotypes. The scientists estimated that the final version of the 1,000 Genomes Project resource includes more than ninety-nine percent of SNPs with a frequency of at least one percent in the populations they studied. The findings show that although most common genetic variants are shared across the world, rarer variants are typically restricted to one or a few populations. They identified 762,000 variants that are rare, or present in less than 0.5 percent of individuals, within a global context, but much more common, that is present in greater than five percent of individuals, in at least one population.
Legacy and Impact
Scholarly citation practices reflect the 1,000 Genomes Project’s influence on biomedical science. The Project’s 2012 paper reporting on Phase I has been cited over 8,600 times as of 2025. Its 2015 paper, "A Global Reference for Human Genetic Variation,” has been cited nearly 18,000 times as of 2025.
The main scientific use of the 1,000 Genomes data is in providing finer grain details to existing genome-wide association studies. Most GWAS scientists conducted as of 2025 have used the HapMap data and reports. GWAS using the HapMap led to some disease-related discoveries, including for macular degeneration, multiple sclerosis, and heart disease. Compared to the HapMap, the 1,000 Genomes Project has much more complete coverage of genetic variation, both along each chromosome and in the population at large. Thus, scientists conducting GWAS can use the 1,000 Genomes Project to fill in missing details along each haplotype that they have associated with a disease condition through a GWAS. Scientists call that process imputing genotypes. Imputing genotypes using the 1,000 Genomes Project can enable researchers to uncover additional genetic variants that may be related to disease. Researchers have used the 1,000 Genomes Project data to identify previously missed genetic variations linked to a number of conditions, including celiac disease, prostate cancer, glioma, type 2 diabetes, coronary artery disease, epithelial ovarian cancer, breast cancer, glycaemia, and obesity.
Another use of the 1,000 Genomes data is for studies of human evolution. Scientists have used the 1,000 Genomes Project to investigate questions such as which parts of the genome have been under natural selection. Scientists can infer the occurrence of natural selection on a genome region by looking at the extent of genetic variation around the region. Areas experiencing natural selection will show less genetic variation in a population since when some genotypes are favored over others they increase in frequency and reduce overall diversity in that region. Scientists have also explored questions related to the hypothesis than modern humans evolved in Africa and migrated from there around the world, known as the out-of-Africa hypothesis. According to Paul Flicek, who studies bioinformatics with the European Molecular Biology Laboratory in Cambridge, England, and who worked on the 1,000 Genomes Project, the results of the Project support that hypothesis by showing that African populations carry a larger amount of genetic diversity than non-African populations. While a majority of common genetic variants are shared globally, African populations have the most continental and population-specific variants, and European and American populations have the least, a reflection of the migratory history of humans out of Africa and around the world.
Since the 1,000 Genomes Project was completed, other larger genome sequencing efforts have taken place. For example, the British government sponsored a project to sequence 100,000 genomes of British residents, which it completed in 2018. That project, like some others, attempts to link DNA sequence data with specific disease conditions of the donors. Because patient information is included in the database, there are proprietary and privacy concerns that prevent the database from being publicly accessible. In addition, because the collection is made up of information from only British residents, the findings may not be as relevant or applicable to other groups.
By sequencing the whole genomes of 2,509 individuals from twenty-six populations of people living around the world, the 1,000 Genomes Project provided scientists with a much more detailed picture of human genetic variation than had existed previously. As of 2025, the geographically diverse, publicly accessible 1,000 Genomes Project data is still in use and contributing to scientific discovery.
Sources
- 1000 Genomes Project Consortium. “A Global Reference for Human Genetic Variation.” Nature 526 (2015): 68–74. https://www.nature.com/articles/nature15393 (Accessed July 9, 2024).
- 1000 Genomes Project Consortium. “An Integrated Map of Genetic Variation from 1,092 Human Genomes.” Nature 491 (2012): 56–65. https://www.nature.com/articles/nature11632 (Accessed July 9, 2024).
- 1000 Genomes. Meeting Report: A Workshop to Plan a Deep Catalog of Human Genetic Variation. 1000 Genomes. https://www.internationalgenome.org/sites/1000genomes.org/files/docs/1000Genomes-MeetingReport.pdf (Accessed July 9, 2024).
- Belsare, Saurabh, Michal Levy-Sakin, Yulia Mostovoy, et al. "Evaluating the Quality of the 1000 Genomes Project Data." BMC Genomics 20 (2019): 620.
- Coriell Institute for Medical Research. “1000 Genomes Project.” Coriell Institute for Medical Research. https://www.coriell.org/1/NHGRI/Collections/1000-Genomes-Project-Collection/1000-Genomes-Project (Accessed July 9, 2024).
- De Vries, Paul S., Maria Sabater-Lleal, Daniel I. Chasman, Stella Trompet, Tarunveer S. Ahluwalia, Alexander Teumer, Marcus E. Kleber et al. "Comparison of HapMap and 1000 Genomes Reference Panels in a Large-Scale Genome-Wide Association Study." PLoS One 12 (2017): e0167742.
- Manolio, Teri A., and Francis S. Collins. “The HapMap and Genome-Wide Association Studies in Diagnosis and Therapy.” Annual review of medicine 60 (2009): 443–56.
- National Human Genome Research Institute. “1000 Genomes Project: Defining Genetic Variation in People.” YouTube, uploaded by National Human Genome Research Institute. https://www.youtube.com/watch?v=ob581Nsvynw
- Nikpay, Majid, Anuj Goel, Hong-Hee Won, et al. "A Comprehensive 1000 Genomes–Based Genome-Wide Association Meta-Analysis of Coronary Artery Disease." Nature genetics 47 (2015): 1121–30.
- Rajagopalan, Ramya. M., & Joan H. Fujimura. “Variations on a Chip: Technologies of Difference in Human Genetics Research.” Journal of the History of Biology 51 (2018): 841–73.
- The International Genome Sample Resource. “The 1000 Genomes Project.” The International Genome Sample Resource. https://www.internationalgenome.org/1000-genomes-summary (July 9, 2024).
- Zheng-Bradley, Xiangqun, and Paul Flicek. “Applications of the 1000 Genomes Project Resources.” Briefings in Functional Genomics 16 (2017): 163–70.
Keywords
Editor
How to cite
Publisher
Handle
Rights
Articles Rights and Graphics
Copyright Arizona Board of Regents Licensed as Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported (CC BY-NC-SA 3.0)