The release of concepts from the human genome in 2001 was an important achievement1,2. Scientists were able to study long stretches of each human chromosome, base by base, for the first time. As such, researchers were able to begin to understand how individual genes were ordered, and how the surrounding non-protein-encoding DNA was structured and organized. Despite this amazing progress, the concept was still incomplete, with more than 150 million bases missing3. Technological advances in the intervening years have allowed researchers to add to the concept, with the complete sequence of a chromosome finally reached4,5 in 2020. As a result, new and uncharacterized parts of the human genome begin to emerge, ushering in another exciting period of biological discovery.
What exactly is contained in the concept genome? The original concept contains many previously unexplored intergenic regions. It also included the vast majority of genes. The International Human Genome Order Consortium1 initially estimated that the genome contained 30,000-40,000 protein-coding genes, although the publication of an updated genome6 in 2004, together with improved no-prediction approaches7, led to the figure being revised to about 20,000. The 2004 genome produced a high-resolution map of 2.85 billion nucleotides from euchromatin – the more loosely packed regions of DNA, which are enriched in genes and make up about 92% of the human genome.
The reference genome launched the scientific community in an era of genome exploration, shifting the focus from single genes to more complete, genome-wide studies. However, there were gaps in each of the 23 pairs of human chromosomes, which are estimated to contain more than 150 megabases with an unknown sequence3 (Fig. 1). The largest gaps were enriched in places with highly repetitive DNA or sequences for which there are almost identical copies. These sections were originally difficult to clone, sequence and assemble correctly. As a result, the human genome project has deliberately under-represented these recurring sequences. Although researchers had a very basic idea of the nature of sequences in these regions, the regions’ high-resolution genomic organization remained elusive.
Early attempts to close the gaps used long row reading to stretch the repeating rows – but such reading was initially extremely erroneous. In the 2010s, new opportunities arose, thanks to the progress in the ability to read longer pieces of the series (for example set out in refs. 8 and 9), along with the development of scalable bioinformatics instruments. Sequences of ten to hundreds of kilobases allowed the study of the genomic organization of very moderate gaps. It provided insight into some subtelomeric regions9 – repeating DNA adjacent to the telomere structures that cover the points of the chromosomes. It also made possible the study of the first centromere satellite setup10, in which short series are repeated together for about 300 kilobases. A subset of segmental duplications (stretch ranges that share 90-100% of their bases and are found in multiple locations) have also been resolved, many of which contain genes that were previously missing in the reference genome.9,11. However, many of the most repetitive regions with a multiple base have remained inflexible.
Over the past few years, the combination of both ultra-long reading has read9 and very accurate data that has been read for a long time12 is a game changer for solving these regions13,14, revealing for the first time extraordinarily long stretches of tandem iterations and regions enriched with segmental duplication. By breaking down these technological barriers, scientists are now discovering extensive regions that have been repeated and that could cover millions of bases, forming the entire short arms of chromosomes.
Researchers do not yet fully understand why parts of the human genome are so organized. But gaining such an understanding will undoubtedly be valuable, because these series, which are frequently repeated, are often placed in places that are vital. Long pieces of ribosomal DNA (rDNA) repeat encode RNA components of the cell’s protein synthetic machinery and play an important role in the nuclear organization15. And the repetitive DNA of structures called centromeres is essential for the proper separation of chromosomes during cell division16.
These large parts of repeating DNA come with different sets of rules regarding their genomic organization and evolution. They are also subject to different epigenetic regulation (molecular modifications to DNA and associated proteins that do not alter the underlying DNA sequence), resulting in repeat DNA differing from euchromatin in terms of its organization, replication timing, and transcriptional activity.17–19. Many instruments and datasets throughout the genome are not yet able to capture all this information from highly repetitive DNA regions, so scientists do not yet have a complete picture of what transcription factors bind to it, how these regions are spatially organized in the nucleus, or how the regulation of these parts of our genome changes during development and in disease states. Now, just like the initial release of the genome decades ago, researchers are confronted with a new, unexplored functional landscape in the human genome. Access to this information will help technology and innovation include these recurring regions, which in turn broadens our understanding of genomic biology.
In recent years, scientists have read very long and very accurate sequences to reconstruct whole human chromosomes from telomere to telomere.4,5. Last year, an almost complete human reference genome was also released from an effective ‘haploid’ human cell line, with only five remaining gaps indicating the sites of rDNA arrays (go.nature.com/3rgz93y). In this line, cells have two identical pairs of chromosomes, which simplifies the challenge of repeated assembly compared to typical human cells (diploid, with different chromosomes inherited from the mother and father). Together, these maps provide the first high-resolution look at centromeric regions, segmental duplications, subtelomeric repeats, and each of the five acrocentric chromosomes, with very short arms consisting almost entirely of highly repeating DNA.
It is tempting to think that scientists are finally approaching the finish line. However, a single genome composition, even if complete with perfect sequence accuracy, is an insufficient reference to study the sequence variation that exists in the human population. Existing maps that map the diversity in the euchromatic parts of the genome need to be expanded to fully capture repeating regions, where copy number and repeating organization differ between individuals. To do this, strategies must be developed for routine production and analysis of complete human diploid genomes. The goal of achieving a more complete and comprehensive reference to humanity will undoubtedly improve our understanding of the genome structure and its role in human diseases, and join the promise and legacy of the Human Genome Project.