The Human Genome Project
by Prof.Siddharth Sanghvi
1. Introduction: The Mega Project
HGP (1990-2003) aimed to sequence the entire human DNA (approx. 3 × 109 bp). It was a "mega project" due to its scale and cost (est. $9 billion USD). Enabled by genetic engineering and rapid DNA sequencing, it spurred Bioinformatics development. The project was coordinated by U.S. Dept. of Energy & NIH, with international collaboration.
2. Goals of HGP
- Identify all ~20,000-25,000 human genes.
- Determine the 3 billion base pair sequence.
- Store data in databases.
- Improve data analysis tools.
- Transfer technology to industry.
- Address ELSI (Ethical, Legal, Social Issues).
3. Methodologies
Two main approaches were used:
3.1. Expressed Sequence Tags (ESTs)
- Focus: Sequencing only DNA expressed as RNA (mRNA).
- Purpose: Quickly identify protein-coding genes.
3.2. Sequence Annotation (Blind Approach)
- Focus: Sequencing the entire genome (coding & non-coding).
- Purpose: Later assign functions to regions (Sequence Annotation). Provides a complete blueprint.
3.3. Sequencing Technology: Sanger Sequencing (Dideoxy Method)
- DNA is isolated, then cut into smaller fragments.
- Fragments are cloned (amplified) in hosts using vectors.
- Sequenced by automated DNA sequencers based on Frederick Sanger's method.
Principle (Chain Termination):
- Uses dideoxynucleotides (ddNTPs) which lack a 3'-OH group.
- ddNTPs cause chain termination when incorporated.
- Generates fragments of varying lengths, allowing sequence determination.
3.4. Vectors and Hosts
- Hosts: Bacteria and Yeast.
- Vectors: BAC (Bacterial Artificial Chromosomes) and YAC (Yeast Artificial Chromosomes) for large DNA fragments.
3.5. Assembly and Mapping
- Sequenced fragments arranged by overlapping regions using computer programs.
- Sequences assigned to chromosomes (Chromosome 1 completed last, May 2006).
- Genetic and physical maps assigned using polymorphism of restriction sites and repetitive DNA (microsatellites).
4. Salient Features of Human Genome
- Size: ~3164.7 million bases.
- Genes: ~30,000 genes (lower than expected).
- Average gene size: 3000 bases; largest is dystrophin (2.4 million bases).
- Similarity: 99.9% identical among all people.
- Unknown function: Over 50% of genes have unknown functions.
- Coding DNA: Less than 2% codes for proteins.
- Repeated sequences: Large portion of genome, non-coding, provide structural/evolutionary insights.
- Chromosome 1: Most genes (2968); Y chromosome: Fewest (231).
- SNPs (Single Nucleotide Polymorphisms): ~1.4 million locations of single base differences, useful for disease mapping and history.
5. Applications and Future Challenges
HGP enables new research approaches, studying all genes/transcripts simultaneously to understand complex biological networks.
- Diagnosis, treatment, and prevention of human disorders.
- Insights into non-human organisms for healthcare, agriculture, energy, environment.
6. Sequenced Non-Human Model Organisms
| Organism |
Common Name |
Genome Size (approx. bp) |
Notes |
| Bacteriophage φ×174 |
|
5386 nucleotides |
Smallest known DNA genome (single-stranded) |
| Escherichia coli (E. coli) |
Bacterium |
4.6 × 106 |
First sequenced bacterium |
| Saccharomyces cerevisiae |
Yeast |
1.2 × 107 |
First eukaryotic genome sequenced |
| Caenorhabditis elegans (C. elegans) |
Nematode |
1.0 × 108 |
First multicellular organism sequenced |
| Drosophila melanogaster |
Fruit fly |
1.4 × 108 |
Model for genetics |
| Arabidopsis thaliana |
Thale cress (plant) |
1.35 × 108 |
First plant genome sequenced |
| Triticum aestivum |
Wheat (plant) |
1.7 × 1010 |
Large hexaploid plant genome |
| Polychaos dubium |
Amoeba |
~6.7 × 1011 |
Largest known genome for any organism |