The Human Pangenome Reference is a collection of genomes from a diverse cohort of individuals compiled by the Human Pangenome Reference Consortium (HPRC). This first draft pangenome comprises 47 phased, diploid assemblies from a diverse cohort of individuals and was intended to capture the genetic diversity of the human population. The development of this pangenome seeks to address perceived shortcomings in the current human reference genome by offering a more comprehensive and inclusive resource for genomic research and analysis.[1]

Shared sequences and structural variants between genomes in Human Pangenome Reference

The HPRC, funded by the National Human Genome Research Institute, aims to create a complete and diverse human reference genome, known as the human pangenome reference, to better represent the global genomic landscape of diverse human populations. The HPRC has released a new high-quality collection of reference human genome sequences, which includes genome sequences from 47 individuals of diverse ancestries, with the goal of increasing that number to 350 by mid-2024[2].

The pangenome concept, originating from the study of prokaryotes, has been extended to multicellular eukaryotic organisms, including humans. The human pangenome has significant implications for population genetics, phylogenetics, and public health policy, as it can inform the genetic basis of diseases and personalized treatments by providing insights into the genetic diversity of human populations[3].

The new human pangenome reference integrates the missing 8% of the human genome sequence, adding over 100 million new bases. It aims to capture more population diversity than the previous reference sequence and is based on 94 high-quality haploid assemblies from individuals with broad genetic diversity. The generation of this reference genome focuses on eliminating gaps, incorporating complex genomic sequence features, and encompassing a broader spectrum of human genome diversity[4].

Background

The concept of a pangenome refers to a complete set of genes, regulatory elements, and non-genetic segments present in different numbers across individuals or lineages of a species. In the context of the human pangenome, it encompasses a more diverse representation of genetic variation across human populations, aiming to address the limitations of the single reference genome by capturing a broader range of genetic variation, including structural variants and alternative alleles​​[1].

History

The human reference genome, initially drafted over 20 years ago, is a composite of merged haplotypes from more than 20 individuals, with a single individual contributing to approximately 70% of the sequence. However, it has limitations, including biases and errors, and does not fully represent the global human genomic variation. The majority of genomic research (GRCh38 and T2T-CHM13 previous references) has focused on individuals of European descent which leads to a bias in available datasets for analysis. Consequently, precision medicine primarily relies on genomic variations found within populations of European ancestry. This limited scope overlooks a significant portion of global genetic diversity crucial for understanding clinical phenotypes[2]. To overcome this, the Human Pangenome Reference Consortium (HPRC) has been working on creating a more complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity integrating genome sequences from a diverse array of individuals. Its primary objectives include enhancing gene-disease association studies across populations and serving as an extensive genetic resource for future biomedical research and precision medicine endeavors[1][2].

Comparison of different human genome references
Reference Latest update First release Sequencing Technologies Pros Cons
Human Reference Genome February 3, 2022 (GRCh38.p14) February 12, 2001 Sanger sequencing Well-developed tools, Many databases were built based on this reference, Wider annotation, Simple interpretation of results Limited diversity, Not complete (gaps in centromeric, telomeric and other repeat regions)
Telomere-to-Telomere Human Genome Jan 24, 2022 (T2T-CHM13v2.0) Jan 22, 2020 PacBio HiFi sequencing, ONT ultralong-read sequencing, Illumina PCR-Free sequencing, Hi-C Illumina short-read sequencing, BioNano optical maps, single-cell DNA template strand sequencing (Strand-seq)[5] Gapless, High Accuracy, Improved structural variants discovery[5] Limited diversity, More memory requirements, Less annotation, Underdeveloped tools for analysis
HPRC Pangenome May 10, 2023 May 10, 2023 PacBio HiFi sequencing, ONT long-read sequencing, Hi-C Illumina short-read sequencing[1] Representing population diversity, Fully phased, Complete and gapless, High accuracy, Improved structural variants discovery[6] Large storage and memory requirement, Less annotation, Underdeveloped tools for analysis[6]

The historical background and context for understanding the topic lie in the limitations of the current linear composite human reference genome, which has led to an observational bias and hinders studies beyond its boundaries. The transition to a pangenomic reference, as envisioned by the HPRC, is aimed at addressing these limitations by capturing a more comprehensive and diverse portrayal of global genomic variation[1][2].

To summarize, the pangenome concept encompasses a more inclusive representation of genetic variation across human populations, aiming to overcome the limitations of the current single reference genome. The HPRC's efforts in creating a human pangenome reference seek to provide a more inclusive and complete representation of global genomic diversity.

Properties of Human Pangenome Reference

The Human Pangenome Reference Consortium (HPRC) has developed a draft human pangenome reference, which includes 47 phased, diploid assemblies from a genetically diverse cohort of individuals. The HPRC samples were sequenced using Pacific Biosciences (PacBio) high-fidelity (HiFi) and Oxford Nanopore Technologies (ONT) long-read sequencing, Bionano optical maps and high-coverage Hi-C Illumina short-read sequencing[1].

Capturing variants

These assemblies are reported to cover more than 99% of the expected sequence in each genome and exhibit an accuracy of over 99% at both the structural and base pair levels. The pangenome captures known variants and haplotypes, reveals new alleles at structurally complex loci, and adds 119 million base pairs of euchromatic polymorphic sequence and 1,115 gene duplications relative to the existing reference GRCh38, with roughly 90 million of the additional base pairs derived from structural variation. Using this draft pangenome for analyzing short-read data has shown a 34% reduction in small variant discovery errors and a 104% increase in the detection of structural variants per haplotype compared to GRCh38-based workflows[1].

Representation of diversity

The PRC's efforts are part of a broader initiative to sequence and assemble genomes from individuals across diverse populations, with the goal of better representing the genomic landscape of human diversity. The consortium aims to increase the number of genome sequences to 350 by mid-2024, providing a more complete and inclusive resource for genomic research and analysis.[1]The development of the human pangenome reference marks a notable advancement in genomics, as it offers a more accurate and diverse depiction of global genomic variation. This development is expected to enhance gene-disease association studies across populations, broaden the scope of genomics research to encompass the most repetitive and polymorphic regions of the genome, and serve as a valuable genetic resource for future studies[1].

The HPRC is funded by the National Human Genome Research Institute to sequence and assemble genomes from individuals from diverse populations in order to better represent the genomic landscape of diverse human populations[4][7]. HPRC sample subpopulations includes ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest US; CHS, Han Chinese South; CLM, Colombian in Medellin, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division; KHV, Kinh in Ho Chi Minh City, Vietnam; MKK, Maasai in Kinyawa, Kenya; MSL, Mende in Sierra Leone; PEL, Peruvian in Lima, Peru; PJL, Punjabi in Lahore, Pakistan; PUR, Puerto Rican in Puerto Rico; YRI, Yoruba in Ibadan, Nigeria[1].

The human pangenome reference is more comprehensive than previous reference sequences. It incorporates over 100 million new bases from 47 people with diverse ancestries, capturing more population diversity than previous references[8][4].

Ensuring explicit consent to data sharing and protecting individual privacy are essential considerations. Encrypting genetic data while providing access to the pangenome may be a potential solution to balance privacy and data sharing[3].

Impact in human diseases

The pangenome reference is expected to have a profound impact on studies of the genetic basis of human diseases. It expected to enable the discovery of disease-risk alleles and previously unobserved rare variants, especially in regions that are inaccessible to standard, short-read sequencing technologies. Additionally, it is projected to enhance genetic diagnosis, functional annotation of variants, and genotyping accuracy, making it less dependent on ancestry.[1] Pangenome data could reveal important information about diseases, health conditions, and health outcomes. However, it is important to consider how this information should be used and its implications for public health policy[3].

Data generation and storage

The large amount of data required to map the human pangenome presents challenges in terms of data generation and storage. The volume of data required may pose obstacles to implementation. The technical implications of handling such vast amounts of data should not be overlooked. It may be challenging to genotype a sufficient number of individuals to achieve a reliable and accurate representation of the human pangenome due to the large scale required, potentially exceeding the capabilities of current technology[3].

Ethical considerations

Data generated from individuals involved in pangenome projects contain highly sensitive and personal information. This raises ethical issues related to privacy, data protection, and potential misuse, including the risk of identifying subjects or discriminating against certain populations or individuals[3].

Human Pangenome generation

Sample selection and sequencing

The pangenome reference includes 47 fully phased diploid genomes. Among these, 29 genomes were entirely generated by HPRC, while the remaining 18 were produced by other efforts[1]. Additional data for some of these samples were generated by HPRC. The 27 HPRC samples were chosen from the 1KG lymphoblastoid cell lines, classified as karyotypically normal with low passage. Another prerequisite was the availability of whole-genome sequencing data for both parents.

These sequencing technologies were used to collect information: Pacific Biosciences (PacBio) high-fidelity (HiFi) with 39.7× HiFi sequence depth of coverage, Oxford Nanopore Technologies (ONT) long-read sequencing, and Bionano optical maps and high-coverage Hi-C Illumina short-read sequencing. To analyze the 18 additional samples, they employed the nanopore unsheared long-read sequencing protocol, resulting in approximately 60× coverage of unsheared sequencing data[1].

A brief overview of different steps in genome de novo assembly

Assembling genomes

The Trio-Hifiasm[9][10] tool was selected as the primary assembler following thorough benchmarking of multiple alternatives. Trio-Hifiasm leverages PacBio HiFi long-read sequences and parental Illumina short-read sequences to generate highly phased contig assemblies[1]. The following steps were done to eliminate adaptor and nonhuman sequence impurities and guarantee a singular mitochondrial assembly per maternal contig:

  1. Assembly assessment
  2. Regional assembly reliability
  3. Completeness and CNV

Constructing the pangenome graph

Three different tools were used to construct the pangenome graph:

Applications

The reported accuracy and completeness of the human pangenome reference, coupled with its diverse representation achieved through sequencing multiple individuals from various locations, present opportunities for the identification of novel variations and insights. The human pangenome, purportedly capable of mitigating mapping biases associated with single linear reference genomes such as like GRCh38 or T2T-CHM13, is said to have potential applications in genomics.

Small variants

An application of note is pangenome-based short variant discovery, involving the alignment of short reads to a pangenome graph to enhance the accuracy of calling small variants like SNPs and indels. This method should exhibit improved performance compared to traditional approaches, particularly in regions of complexity and genes of medical relevance. Furthermore, the pangenome purportedly aids in variant calling in parent-child trios, potentially enhancing accuracy in this context[1].

Structural Variants

Another key application lies in SV genotyping, where the sequence-resolved structural variants (SVs) within the pangenome enable the identification and genotyping of diverse SV alleles[1].

Variable Number Tandem Repeat

Improvements in VNTR (Variable Number Tandem Repeat) regions mapping, RNA sequencing mapping, chromatin immunoprecipitation and sequencing analysis were also reported.

In summary, the pangenome is regarded as a resource with potential for enhancing variant discovery, population genetics analyses, and the detection of complex genetic events that may not be identified by conventional reference genomes[1].

Limitations

Lack of established tools

Most of the current tools developed are compatible with GRCh38, the human reference genome. It is known that variant discovery using the human reference genome fails to capture all the variations because it lacks diversity and is not complete and accurate. Using graph-based references for alignment can increase the accuracy of the analysis as it is more diverse and complete[6].

Alignment is one of the example in which new bioinformatics tools supporting the pangenome reference needs to be developed. Because most bioinformatics pipelines or tools are based on the human reference genome, there is a need for tools that utilize the human pangenome reference instead of the human reference genome. There have been efforts to develop tools for this purpose in different applications including:

Currently available application and tools for Human Pangenome Reference[6]

Scale-up problems

The estimates show that by 2025, the number of genomes that are sequenced will be 100 million to 2 billion which according to price trends, the storage for storing these data would be expensive and problematic.[6]. With the increasing availability of personal genome data, the initial dataset size -currently in the thousands of gigabase-scale genomes- is poised to expand exponentially. This growth will necessitate the development of more efficient analysis algorithms and data representation formats that can handle the escalating demands on time, memory, and storage space[6]

Privacy problems for expanding the dataset

Expanding the human pangenome reference to proposed 700 haplotypes (350 individuals) poses challenges in ensuring inclusivity due to linguistic, literacy, socioeconomic barriers, and distrust among racial-ethnic minorities and aborigines. Obtaining informed consent becomes complex as participants need to understand project implications. Balancing the release of post-analysis genomic data with ethical considerations presents dilemmas concerning complete information disclosure.[6]

Legal considerations

There are concerns about genetic discrimination, denial of rights or opportunities, and ethical use of pangenome data, including issues related to consent, ownership, and access. Establishing a legal and regulatory framework is crucial to address issues such as ownership, access, use, protection, and intellectual property rights associated with pangenome data[3].

References

  1. ^ a b c d e f g h i j k l m n o p q Liao, Wen-Wei; Asri, Mobin; Ebler, Jana; Doerr, Daniel; Haukness, Marina; Hickey, Glenn; Lu, Shuangjia; Lucas, Julian K.; Monlong, Jean; Abel, Haley J.; Buonaiuto, Silvia; Chang, Xian H.; Cheng, Haoyu; Chu, Justin; Colonna, Vincenza (May 2023). "A draft human pangenome reference". Nature. 617 (7960): 312–324. Bibcode:2023Natur.617..312L. doi:10.1038/s41586-023-05896-x. ISSN 1476-4687. PMC 10172123. PMID 37165242.
  2. ^ a b c d Wang, Ting; Antonacci-Fulton, Lucinda; Howe, Kerstin; Lawson, Heather A.; Lucas, Julian K.; Phillippy, Adam M.; Popejoy, Alice B.; Asri, Mobin; Carson, Caryn; Chaisson, Mark J. P.; Chang, Xian; Cook-Deegan, Robert; Felsenfeld, Adam L.; Fulton, Robert S.; Garrison, Erik P. (April 2022). "The Human Pangenome Project: a global resource to map genomic diversity". Nature. 604 (7906): 437–446. Bibcode:2022Natur.604..437W. doi:10.1038/s41586-022-04601-8. ISSN 1476-4687. PMC 9402379. PMID 35444317.
  3. ^ a b c d e f Abondio, Paolo; Cilli, Elisabetta; Luiselli, Donata (June 2023). "Human Pangenomics: Promises and Challenges of a Distributed Genomic Reference". Life. 13 (6): 1360. Bibcode:2023Life...13.1360A. doi:10.3390/life13061360. ISSN 2075-1729. PMC 10304804. PMID 37374141.
  4. ^ a b c Lee, HoJoon; Greer, Stephanie U.; Pavlichin, Dmitri S.; Zhou, Bo; Urban, Alexander E.; Weissman, Tsachy; Liao, Wen-Wei; Asri, Mobin; Ebler, Jana; Doerr, Daniel; Haukness, Marina; Hickey, Glenn; Lu, Shuangjia; Lucas, Julian K.; Monlong, Jean (2023-08-28). "Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome". Cell Reports Methods. 3 (8): 100543. doi:10.1016/j.crmeth.2023.100543. ISSN 2667-2375. PMC 10475782. PMID 37671027.
  5. ^ a b Nurk, Sergey; Koren, Sergey; Rhie, Arang; Rautiainen, Mikko; Bzikadze, Andrey V.; Mikheenko, Alla; Vollger, Mitchell R.; Altemose, Nicolas; Uralsky, Lev; Gershman, Ariel; Aganezov, Sergey; Hoyt, Savannah J.; Diekhans, Mark; Logsdon, Glennis A.; Alonge, Michael (April 2022). "The complete sequence of a human genome". Science. 376 (6588): 44–53. Bibcode:2022Sci...376...44N. doi:10.1126/science.abj6987. ISSN 0036-8075. PMC 9186530. PMID 35357919.
  6. ^ a b c d e f g Singh, Vipin; Pandey, Shweta; Bhardwaj, Anshu (2022). "From the reference human genome to human pangenome: Premise, promise and challenge". Frontiers in Genetics. 13. doi:10.3389/fgene.2022.1042550. ISSN 1664-8021. PMC 9684177. PMID 36437921.
  7. ^ "Welcome to HPRC". humanpangenome.org. Retrieved 2024-02-23.
  8. ^ "A new human "pangenome" reference". www.genome.gov. Retrieved 2024-02-23.
  9. ^ Jarvis, Erich D.; Formenti, Giulio; Rhie, Arang; Guarracino, Andrea; Yang, Chentao; Wood, Jonathan; Tracey, Alan; Thibaud-Nissen, Francoise; Vollger, Mitchell R.; Porubsky, David; Cheng, Haoyu; Asri, Mobin; Logsdon, Glennis A.; Carnevali, Paolo; Chaisson, Mark J. P. (November 2022). "Semi-automated assembly of high-quality diploid human reference genomes". Nature. 611 (7936): 519–531. Bibcode:2022Natur.611..519J. doi:10.1038/s41586-022-05325-5. ISSN 1476-4687. PMC 9668749. PMID 36261518.
  10. ^ Li, Heng; Bloom, Jonathan M.; Farjoun, Yossi; Fleharty, Mark; Gauthier, Laura; Neale, Benjamin; MacArthur, Daniel (August 2018). "A synthetic-diploid benchmark for accurate variant-calling evaluation". Nature Methods. 15 (8): 595–597. doi:10.1038/s41592-018-0054-7. ISSN 1548-7105. PMC 6341484. PMID 30013044.
  11. ^ Li, Heng; Feng, Xiaowen; Chu, Chong (2020-10-16). "The design and construction of reference pangenome graphs with minigraph". Genome Biology. 21 (1): 265. doi:10.1186/s13059-020-02168-z. ISSN 1474-760X. PMC 7568353. PMID 33066802.
  12. ^ Li, Heng (2018-05-10). "Minimap2: pairwise alignment for nucleotide sequences". Bioinformatics. 34 (18): 3094–3100. doi:10.1093/bioinformatics/bty191. ISSN 1367-4803. PMC 6137996. PMID 29750242.
  13. ^ Hickey, Glenn; Monlong, Jean; Ebler, Jana; Novak, Adam M.; Eizenga, Jordan M.; Gao, Yan; Marschall, Tobias; Li, Heng; Paten, Benedict (2023-05-10). "Pangenome graph construction from genome alignments with Minigraph-Cactus". Nature Biotechnology: 1–11. doi:10.1038/s41587-023-01793-w. ISSN 1546-1696. PMC 10638906. PMID 37165083.
  14. ^ Armstrong, Joel; Hickey, Glenn; Diekhans, Mark; Fiddes, Ian T.; Novak, Adam M.; Deran, Alden; Fang, Qi; Xie, Duo; Feng, Shaohong; Stiller, Josefin; Genereux, Diane; Johnson, Jeremy; Marinescu, Voichita Dana; Alföldi, Jessica; Harris, Robert S. (November 2020). "Progressive Cactus is a multiple-genome aligner for the thousand-genome era". Nature. 587 (7833): 246–251. Bibcode:2020Natur.587..246A. doi:10.1038/s41586-020-2871-y. ISSN 1476-4687. PMC 7673649. PMID 33177663.
  15. ^ a b c Garrison, Erik; Guarracino, Andrea; Heumos, Simon; Villani, Flavia; Bao, Zhigui; Tattini, Lorenzo; Hagmann, Jörg; Vorbrugg, Sebastian; Marco-Sola, Santiago (2023-04-06), "Building pangenome graphs", bioRxiv : The Preprint Server for Biology, doi:10.1101/2023.04.05.535718, PMC 10104075, PMID 37066137, retrieved 2024-02-24

Category:Human genome