ExPASy logo ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot
Search for
project logo HPI
The UniProtKB/Swiss-Prot Human Proteome Initiative
Swiss-Prot logo

Version July 2007 (pdf)

In the year 2000, the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI) announced their intention to set up a major effort to annotate, describe and distribute to the life science community a large amount of extensive curation regarding human protein sequences. This initiative - coined the Human Proteome Initiative (HPI) - is combined with an appeal to the user community to participate actively in such an effort and at various levels.

Once upon a time...

Dr. Murray - OED creator In 1857, a group of English lexicographers and philologists met and decided to collect information on the meaning and usage of all words in the English language. This major collective effort spanned a number of decades and resulted in one of the most impressive monuments of knowledge on any given language: the Oxford English Dictionary(OED). In order to create and maintain the dictionary they gathered a team of highly qualified linguistic experts and complemented their classical approach by what was at that time an innovative concept: they made an appeal to English speakers around the world to send them citations illustrating the use of particular words and how they had evolved over time. Today we could use their original appeal as well as the description of their goal almost verbatim; all we have to do is replace "English language" by "human proteome"!

In 2004, approximately 99% of the human euchromatic genome was accurately sequenced and the current challenge for the scientific community has become the human genome (re)annotation. Four institutes - the European Bioinformatics Institute (EBI), the National Center for Biotechnology Information (NCBI), the University of California at Santa Cruz (UCSC) and the Wellcome Trust Sanger Institute (WTSI) - joined their efforts in order to create a standard set of gene annotations. Toward this end, they launched the Consensus CDS(CCDS) project. Such a collaborative approach - a consequence of which involves sharing results obtained by different automated and manual methods - will undoubtedly be extremely fruitful.

After several years of wild guesses, a consensus has been reached and currently it is estimated that the number of human genes ranges from 20,000 to 25,000. One of the challenges in human biology is to understand how such a relatively limited number of genes can give rise to an organism as complex as Mozart, Matisse or Marie Curie. Complexity is generated at several levels, those being mainly alternative splicing and post-translational modifications (PTMs).

Largely underestimated in the past, alternative splicing appears today to be one of the most important biological events in generating complexity; indeed it is believed that at least 40 - 60% of the total of human genes have alternative splicing isoforms. Large-scale studies on chromosomes 21 and 22 indicate that over 80% of the genes could undergo alternative splicing.

Genomic information does not suffice to predict all the PTMs of which the majority of proteins are the target. Once synthesized on the ribosomes, proteins are subject to a multitude of PTMs. They are cleaved (thus eliminating signal sequences, transit or pro-peptides and initiator methionines); many simple chemical groups can be attached to them (acetyl, methyl, phosphoryl, etc.), as well as a number of more complex molecules, such as sugars and lipids; and finally, proteins can be internally or externally cross-linked (e.g. disulfide bonds). More than two hundred different types of PTM are currently known and many more are yet to be discovered.

HPI complexity

When combining the complexity generated by alternative splicing with that produced by PTMs, it appears that the number of different protein molecules expressed by the 20,000 to 25,000 protein-encoding genes is probably more than one million (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

While the considerations above concerned protein complexity at the level of an individual, additional diversity factors - at the genomic level this time - have to be taken into account when dealing with the entire human population: these are polymorphisms, commonly termed "c-SNPs" (coding single nucleotide polymorphisms) which, after translation, give rise to "SAPs" (single amino-acid polymorphisms). While some of these polymorphisms are linked to disease states, the majority is not, though in many cases they can have a direct or indirect effect on the activities of the proteins.

HPI goals and means

In this context, the HPI's aim is to annotate all known human protein sequences according to the quality standards of UniProtKB/Swiss-Prot. Most UniProtKB/Swiss-Prot sequences are derived from the translation of EMBL/GenBank/DDBJ database nucleotide sequences. Sequences derived from the same gene are manually merged into a single UniProtKB/Swiss-Prot record. During this process, sequence comparison allows us to find and show the most reliable sequence. All discrepancies are carefully analyzed and stored. These can be due to alternative splicing, polymorphism, or unknown reasons such as sequencing errors or as yet uncharacterized polymorphisms. Currently, an average of about 6 nucleotide entries are used to create one human UniProtKB/Swiss-Prot entry and this number is growing continuously. These sequences can be further - fully or partially - confirmed by direct protein sequencing either by the classical Edman sequencing technique or by mass spectrometry methods. Currently, above 15% of the human entries contain such data.

HPI sequence variety

In addition to accurate sequences, UniProtKB/Swiss-Prot manual annotation strives to provide, for each known protein, a wealth of information that includes the description of its function, domain structure, subcellular location, post-translational modifications, variants, similarities to other proteins, etc. This involves not only a critical examination of computer predictions obtained with constantly improving bioinformatics tools but also the careful review of the scientific literature.

The HPI project contains a number of sub-components, which are briefly described below:

  • Annotation of all known human proteins.We plan to finish the manual annotation of the human proteome by September 2008. While most of these sequences are in the UniProtKB/TrEMBL computer-annotated supplement, some may not appear in any sequence database. This is because the coding sequence has not been annotated as such in the DNA databases or because the sequence has not been submitted. In order to obtain a complete human protein dataset, the HGNC-linked set of entries will be complemented with Ensembl gene predictions. It should be noted that some sequences, such as most non-germline immunoglobulins and T-cell receptors, are excluded. We also review and update the annotation of the human sequences currently in UniProtKB/Swiss-Prot.
  • Annotation of mammalian orthologs of human proteins. As we annotate human proteins, we check that orthologs in other mammalian species are also annotated at a level equivalent to that of the cognate human sequences. Currently, the most represented nonprimate mammalian species are mouse (about 13,400 entries), rat (about 6,200 entries), bovine (about 4,100 entries), pig (about 1,200 entries), rabbit (about 840 entries) and dog (about 700 entries). High-throughput cDNA sequencing projects for nonhuman primates are being developed and they are providing new sequences, which are being continuously integrated into UniProtKB/Swiss-Prot. Although not yet impressive, the number of primate entries is rapidly growing, concomitantly with the nucleotide submissions to the EMBL/GenBank/DDBJ databases. In June 2007 (release 53.2), there were about 1,800 orangutan, 900 crab-eating macaque and 610 chimpanzee entries.
  • orthologs
  • Annotation of all known human polymorphisms at the protein sequence level. UniProtKB/Swiss-Prot already holds information on a sizeable amount of SAPs, and it has expanded significantly its effort to store and annotate all 'small' variations at the protein level. Specific web pages have been created for each human sequence variant. Each page displays a synopsis of the information known for a given variant and include, if it can be computed, a structural model of the variant. Mutations that cause major changes to a protein sequence (as is the case for most frameshift mutations) are not and will not be considered to be relevant to UniProtKB/Swiss-Prot, as their deleterious effects on a given protein's function is usually obvious.
  • Annotation of all known post-translational modifications in human proteins. As they are difficult to predict, most post-translational modifications (PTMs) described in UniProtKB/Swiss-Prot come from experimental data. Following strict rules, such data can be transferred "by similarity" to orthologous proteins or to other members of the same protein family. We live in "high-throughput" times and the field of PTMs is not an exception. High throughput projects have been developed in order to identify human proteins subjected to specific PTMs - for the time being mainly phosphorylations, glycosylations, sumoylation and ubiquitination have been investigated - and to compare in this regard physiological and pathological states. The data they produce are being continuously integrated into UniProtKB/Swiss-Prot, as are those dealing with N-terminal acetylation which is readily detected by mass spectrometry.
  • Tight links to structural information. UniProtKB/Swiss-Prot is tightly linked to the PDB/RCSB 3D-structure database and includes many features useful to structural biologists, such as literature references concerning X-ray and NMR papers and DSSP-derived secondary structure information. As less than 15% of all human proteins have been characterized at the level of their 3D-structure, it is important to expand the scope of experimentally-derived structural information by providing homology-derived models for all human proteins for which such an approach is scientifically relevant. This is done through links to the HSSP and SMR databases.

We need you

For all aspects of the HPI project, we would appreciate the help and collaboration of the scientific community. Information regarding the human proteome is highly critical to a large section of the life science community. We therefore greatly encourage the user community to fully participate in this initiative by providing information not only to help the comprehensive annotation of the human proteome but also to speed it up.

The HPI project is a long-term challenge. It will take years to annotate and periodically re-annotate all human proteins so as to obtain a full and useful compendium which will describe the function and, more specifically, the role of these crucial actors involved in most, if not all, biological processes.

"May you live in interesting times!"
is supposedly a proverb used by the Chinese in Antiquity, which was less a blessing, however, than a curse... There is no doubt that the life science community is living in interesting times; it would be agreeable to believe that this is not a curse, but clearly a blessing.

ExPASy logo ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot
 Hosted by ca flag CBR Canada Mirror sites: Australia  Brazil  China  Korea  Switzerland