UniProt Knowledgebase
Swiss-Prot Protein Knowledgebase
TrEMBL Protein Database

Release notes
UniProt release 4.0 of 1-Feb-2005

Content

  Introduction
  UniProt/Swiss-Prot Protein Knowledgebase release statistics
  UniProt/TrEMBL Protein Database release statistics

  Submissions and Updates
  Download information
  Contact
  Citation

  Related documents: UniProt user manual, Recent changes, Forthcoming changes.

Introduction

Release 4.0 of the UniProt Knowledgebase is composed of the UniProt/Swiss-Prot Protein Knowledgebase release 46.0 and the UniProt/TrEMBL Protein Database release 29.0.

More information on these databases can be found in the user manual What is the UniProt Knowledgebase ?.


UniProt/Swiss-Prot protein knowledgebase release 46.0 statistics

Release 46.0 of 01-Feb-2005 of UniProt/Swiss-Prot contains 168'297 sequence entries, comprising 61'443'278 amino acids abstracted from 124'910 references.

The growth of the database is summarized below.

Release Date Number of entries Number of amino acids
2.0 09/86 3'939 900'163
3.0 11/86 4'160 969'641
4.0 04/87 4'387 1'036'010
5.0 09/87 5'205 1'327'683
6.0 01/88 6'102 1'653'982
7.0 04/88 6'821 1'885'771
8.0 08/88 7'724 2'224'465
9.0 11/88 8'702 2'498'140
10.0 03/89 10'008 2'952'613
11.0 07/89 10'856 3'265'966
12.0 10/89 12'305 3'797'482
13.0 01/90 13'837 4'347'336
14.0 04/90 15'409 4'914'264
15.0 08/90 16'941 5'486'399
16.0 11/90 18'364 5'986'949
17.0 02/91 20'024 6'524'504
18.0 05/91 20'772 6'792'034
19.0 08/91 21'795 7'173'785
20.0 11/91 22'654 7'500'130
21.0 03/92 23'742 7'866'596
22.0 05/92 25'044 8'375'696
23.0 08/92 26'706 9'011'391
24.0 12/92 28'154 9'545'427
25.0 04/93 29'955 10'214'020
26.0 07/93 31'808 10'875'091
27.0 10/93 33'329 11'484'420
28.0 02/94 36'000 12'496'420
29.0 06/94 38'303 13'464'008
30.0 10/94 40'292 14'147'368
31.0 02/95 43'470 15'335'248
32.0 11/95 49'340 17'385'503
33.0 02/96 52'205 18'531'384
34.0 10/96 59'021 21'210'389
35.0 11/97 69'113 25'083'768
36.0 07/98 74'019 26'840'295
37.0 12/98 77'977 28'268'293
38.0 07/99 80'000 29'085'965
39.0 05/00 86'593 31'411'114
40.0 10/01 101'602 37'315'215
41.0 02/03 122'564 44'986'459
42.0 10/03 135'850 50'046'799
43.0 03/04 146'720 54'093'154
44.0 07/04 153'871 56'608'159
45.0 10/04 163'235 59'631'787
46.0 02/05 168'297 61'443'278

In rare cases, Swiss-Prot entries are removed. Deleted entries are almost exclusively Open Reading Frames (ORFs) that have been wrongly predicted to code for proteins. When there is enough evidence that these hypothetical proteins are not real we take the decision to remove them from Swiss-Prot. In the document delac_sp.txt, you will find a list of all accession numbers which were previously present in UniProt/Swiss-Prot, but which have now been deleted from the database.


Status of the model organisms

We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:

From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:

Organism Database cross-references Index file Number of sequences
A.thaliana None yet arath.txt 3'110
C.albicans None yet calbican.txt 321
C.elegans Wormpep celegans.txt 2'615
D.discoideum DictyBase dicty.txt 324
D.melanogaster FlyBase fly.txt 2'158
M.musculus MGD mgdtosp.txt 8'676
S.cerevisiae SGD yeast.txt 5'042
S.pombe GeneDB_SPombe pombe.txt 2'712

UniProt/Swiss-Prot release statistics

1.  INTRODUCTION

Release 46.0 of 01-Feb-2005 of UniProt/Swiss-Prot contains 168297 sequence entries,
comprising 61443278 amino acids abstracted from 124910 references. 

4537 sequences have been added since release 45, the sequence data of
866 existing entries has been updated and the annotations of
77494 entries have been revised. This represents an increase of 3%.


2.  AMINO ACID COMPOSITION

   2.1  Composition in percent for the complete database

   Ala (A) 7.81   Gln (Q) 3.94   Leu (L) 9.62   Ser (S) 6.88
   Arg (R) 5.32   Glu (E) 6.61   Lys (K) 5.93   Thr (T) 5.45
   Asn (N) 4.20   Gly (G) 6.93   Met (M) 2.37   Trp (W) 1.15
   Asp (D) 5.30   His (H) 2.28   Phe (F) 4.00   Tyr (Y) 3.07
   Cys (C) 1.56   Ile (I) 5.91   Pro (P) 4.84   Val (V) 6.71

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01


   2.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Arg, Asp, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp


3.  TAXONOMIC ORIGIN

   Total number of species represented in this release of Swiss-Prot: 8826

   The first twenty species represent 62418 sequences:  37.1 % of the total
   number of entries.


   3.1 Table of the frequency of occurrence of species

        Species represented 1x: 4171
                            2x: 1390
                            3x:  699
                            4x:  460
                            5x:  289
                            6x:  265
                            7x:  195
                            8x:  155
                            9x:  129
                           10x:   83
                       11- 20x:  371
                       21- 50x:  293
                       51-100x:   96
                         >100x:  230


   3.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1      11850  Homo sapiens (Human)
       2       8676  Mus musculus (Mouse)
       3       5042  Saccharomyces cerevisiae (Baker's yeast)
       4       4838  Escherichia coli
       5       4079  Rattus norvegicus (Rat)
       6       3110  Arabidopsis thaliana (Mouse-ear cress)
       7       2767  Bacillus subtilis
       8       2712  Schizosaccharomyces pombe (Fission yeast)
       9       2615  Caenorhabditis elegans
      10       2158  Drosophila melanogaster (Fruit fly)
      11       1782  Methanococcus jannaschii
      12       1773  Haemophilus influenzae
      13       1707  Escherichia coli O157:H7
      14       1521  Bos taurus (Bovine)
      15       1468  Salmonella typhimurium
      16       1399  Mycobacterium tuberculosis
      17       1368  Escherichia coli O6
      18       1328  Shigella flexneri
      19       1128  Gallus gallus (Chicken)
      20       1097  Mycobacterium bovis
      21       1051  Salmonella typhi
      22       1012  Pseudomonas aeruginosa
      23        958  Synechocystis sp. (strain PCC 6803)
      24        955  Archaeoglobus fulgidus
      25        923  Sus scrofa (Pig)
      26        908  Xenopus laevis (African clawed frog)
      27        807  Rhizobium meliloti (Sinorhizobium meliloti)
      28        792  Vibrio cholerae
      29        766  Yersinia pestis
      30        747  Oryctolagus cuniculus (Rabbit)
      31        745  Aquifex aeolicus
      32        687  Mycoplasma pneumoniae
      33        681  Pasteurella multocida
      34        629  Vibrio parahaemolyticus
      35        628  Streptomyces coelicolor
      36        617  Bacillus halodurans
      37        612  Mycobacterium leprae
      38        606  Treponema pallidum
      39        578  Vibrio vulnificus
      40        573  Methanobacterium thermoautotrophicum
      41        572  Buchnera aphidicola (subsp. Acyrthosiphon pisum) 
      42        568  Anabaena sp. (strain PCC 7120)
      43        562  Helicobacter pylori (Campylobacter pylori)
      44        561  Buchnera aphidicola (subsp. Schizaphis graminum)
      45        549  Staphylococcus aureus (strain Mu50 / ATCC 700699)
      46        547  Staphylococcus aureus (strain N315)
      47        546  Rickettsia prowazekii
      48        543  Helicobacter pylori J99 (Campylobacter pylori J99)
      49        530  Staphylococcus aureus (strain MW2)
      50        517  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      51        514  Pseudomonas putida (strain KT2440)
      52        513  Zea mays (Maize)
      53        508  Pseudomonas syringae (pv. tomato)
      54        507  Buchnera aphidicola (subsp. Baizongia pistaciae)
      55        499  Staphylococcus epidermidis
      56        499  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
      57        499  Ralstonia solanacearum (Pseudomonas solanacearum)
      58        496  Listeria monocytogenes
      59        492  Listeria innocua
      60        486  Mycoplasma genitalium
      61        486  Rhizobium loti (Mesorhizobium loti)
      62        482  Xanthomonas campestris (pv. campestris)
      63        481  Neisseria meningitidis (serogroup B)
      64        479  Neisseria meningitidis (serogroup A)
      65        472  Clostridium acetobutylicum
      66        467  Bradyrhizobium japonicum
      67        464  Bacillus anthracis
      68        463  Caulobacter crescentus
      69        462  Canis familiaris (Dog)
      70        461  Thermotoga maritima
      71        444  Xanthomonas axonopodis (pv. citri)
      72        442  Streptococcus pneumoniae
      73        438  Oryza sativa (Rice)
      74        438  Xylella fastidiosa
      75        432  Deinococcus radiodurans
      76        428  Pyrococcus horikoshii
      77        428  Chlamydia trachomatis
      78        426  Xylella fastidiosa (strain Temecula1 / ATCC 700964)
      79        424  Pyrococcus abyssi
      80        419  Shewanella oneidensis
      81        417  Borrelia burgdorferi (Lyme disease spirochete)
      82        411  Brucella melitensis
      83        411  Brucella suis
      84        410  Methanosarcina acetivorans
      85        410  Chlamydia pneumoniae (Chlamydophila pneumoniae)
      86        410  Clostridium perfringens
      87        405  Vibrio vulnificus (strain YJ016)
      88        403  Rhizobium sp. (strain NGR234)
      89        400  Chlamydia muridarum
      90        396  Corynebacterium glutamicum (Brevibacterium flavum)
      91        395  Methanosarcina mazei (Methanosarcina frisia)
      92        394  Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
      93        394  Bacillus cereus (strain ATCC 14579 / DSM 31)
      94        393  Brachydanio rerio (Zebrafish) (Danio rerio)
      95        384  Pyrococcus furiosus
      96        380  Oceanobacillus iheyensis
      97        378  Campylobacter jejuni
      98        378  Sulfolobus solfataricus
      99        377  Thermoanaerobacter tengcongensis
     100        372  Photorhabdus luminescens (subsp. laumondii)
     101        372  Neurospora crassa
     102        371  Ovis aries (Sheep)
     103        371  Lactobacillus plantarum
     104        366  Nicotiana tabacum (Common tobacco)
     105        365  Streptococcus pyogenes
     106        360  Streptococcus pneumoniae (strain ATCC BAA-255 / R6)
     107        359  Rickettsia conorii
     108        348  Synechococcus elongatus (Thermosynechococcus elongatus)
     109        344  Streptococcus mutans
     110        335  Aeropyrum pernix
     111        331  Chlorobium tepidum
     112        324  Dictyostelium discoideum (Slime mold)
     113        322  Streptococcus pyogenes (serotype M18)
     114        321  Candida albicans (Yeast)
     115        317  Streptococcus pyogenes (serotype M3)
     116        314  Methanopyrus kandleri
     117        313  Staphylococcus aureus
     118        307  Enterococcus faecalis (Streptococcus faecalis)
     119        304  Pan troglodytes (Chimpanzee)
     120        303  Sulfolobus tokodaii
     121        302  Pisum sativum (Garden pea)
     122        293  Bordetella bronchiseptica (Alcaligenes bronchisepticus)
     123        292  Bordetella pertussis
     124        290  Thermoplasma acidophilum
     125        288  Haemophilus ducreyi
     126        283  Corynebacterium efficiens
     127        283  Triticum aestivum (Wheat)
     128        282  Bordetella parapertussis
     129        279  Streptomyces avermitilis
     130        278  Staphylococcus aureus (strain MRSA252)
     131        277  Staphylococcus aureus (strain MSSA476)
     132        276  Chromobacterium violaceum
     133        273  Fusobacterium nucleatum (subsp. nucleatum)
     134        272  Hordeum vulgare (Barley)
     135        268  Bacteriophage T4
     136        266  Nitrosomonas europaea
     137        264  Glycine max (Soybean)
     138        261  Lycopersicon esculentum (Tomato)
     139        261  Streptococcus agalactiae (serotype V)
     140        259  Streptococcus agalactiae (serotype III)
     141        258  Leptospira interrogans
     142        257  Cavia porcellus (Guinea pig)
     143        256  Solanum tuberosum (Potato)
     144        255  Thermoplasma volcanium
     145        254  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
     146        254  Vaccinia virus (strain Copenhagen) (VACV)
     147        254  Pyrobaculum aerophilum
     148        248  Pseudomonas putida
     149        240  Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
     150        238  Spinacia oleracea (Spinach)
     151        233  Bacillus stearothermophilus
     152        221  Clostridium tetani
     153        221  Wigglesworthia glossinidia brevipalpis
     154        220  Porphyra purpurea
     155        220  Chlamydophila caviae
     156        218  Coxiella burnetii
     157        218  Gloeobacter violaceus
     158        216  Synechococcus sp. (strain WH8102)
     159        212  Kluyveromyces lactis (Yeast)
     160        212  Chlamydomonas reinhardtii
     161        210  Prochlorococcus marinus
     162        210  Bacteroides thetaiotaomicron
     163        209  Macaca mulatta (Rhesus macaque)
     164        208  Equus caballus (Horse)
     165        207  Prochlorococcus marinus (strain MIT 9313)
     166        206  Klebsiella pneumoniae
     167        204  Macaca fascicularis (Crab eating macaque) (Cynomolgus monkey)
     168        200  Vaccinia virus (strain Western Reserve / WR) (VACV)


   3.3  Taxonomic distribution of the sequences

   Kingdom        sequences (% of the database)
    Archaea            9025 (  5%)
    Bacteria          73807 ( 44%)
    Eukaryota         76388 ( 45%)
    Viruses            9077 (  5%)


   Within Eukaryota:

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                  11850 ( 16%)           (  7%)
     Other Mammalia         21659 ( 28%)           ( 13%)
     Other Vertebrata        7019 (  9%)           (  4%)
     Viridiplantae          11826 ( 15%)           (  7%)
     Fungi                  11327 ( 15%)           (  7%)
     Insecta                 4177 (  5%)           (  2%)
     Nematoda                2880 (  4%)           (  2%)
     Other                   5650 (  7%)           (  3%)


4.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50    3303             1001-1100     1432
                 51- 100   11821             1101-1200     1035
                101- 150   17104             1201-1300      739
                151- 200   15970             1301-1400      552
                201- 250   16646             1401-1500      438
                251- 300   14263             1501-1600      277
                301- 350   15036             1601-1700      209
                351- 400   13286             1701-1800      158
                401- 450   10277             1801-1900      173
                451- 500    8760             1901-2000      140
                501- 550    6626             2001-2100       84
                551- 600    4573             2101-2200      127
                601- 650    3841             2201-2300      115
                651- 700    2671             2301-2400       71
                701- 750    2259             2401-2500       63
                751- 800    1926             >2500          445
                801- 850    1541
                851- 900    1697
                901- 950    1183
                951-1000     999


   The average sequence length in Swiss-Prot is 365 amino acids.

   The shortest sequence is   GWA_SEPOF (P83570):     2 amino acids.
   The longest sequence is  SYNE1_HUMAN (Q8NF91):  8797 amino acids.


5.  JOURNAL CITATIONS

   Note: the following citation statistics reflect the number of distinct
         journal citations.

   Total number of journals cited in this release of Swiss-Prot: 1551


   5.1 Table of the frequency of journal citations

        Journals cited 1x:  567
                       2x:  212
                       3x:  102
                       4x:   68
                       5x:   62
                       6x:   34
                       7x:   33
                       8x:   30
                       9x:   20
                      10x:   17
                  11- 20x:  118
                  21- 50x:  123
                  51-100x:   55
                    >100x:  110


   5.2  List of the most cited journals in Swiss-Prot

   Nb    Citations   Journal name
   --    ---------   -------------------------------------------------------------
    1        11442   Journal of Biological Chemistry
    2         5878   Proceedings of the National Academy of Sciences of the U.S.A.
    3         4050   Journal of Bacteriology
    4         3813   Nucleic Acids Research
    5         3789   Gene
    6         3152   Biochemical and Biophysical Research Communications
    7         3125   FEBS Letters
    8         2802   Biochemistry
    9         2751   European Journal of Biochemistry
   10         2612   The EMBO Journal
   11         2403   Nature
   12         2358   Biochimica et Biophysica Acta
   13         2134   Journal of Molecular Biology
   14         2031   Genomics
   15         1927   Molecular and Cellular Biology
   16         1912   Cell
   17         1542   Biochemical Journal
   18         1422   Science
   19         1268   Molecular Microbiology
   20         1216   Plant Molecular Biology
   21         1209   Molecular and General Genetics
   22          980   Journal of Biochemistry
   23          936   Journal of Cell Biology
   24          914   Virology
   25          910   Human Molecular Genetics
   26          838   Nature Genetics
   27          762   Genes and Development
   28          751   Journal of Virology
   29          722   The American Journal of Human Genetics
   30          714   Oncogene
   31          687   Plant Physiology
   32          683   Human Mutation
   33          631   Journal of Immunology
   34          620   Infection and Immunity
   35          612   Archives of Biochemistry and Biophysics
   36          601   Yeast
   37          587   Structure
   38          553   Journal of General Virology
   39          538   Development
   40          529   Microbiology
   41          505   FEMS Microbiology Letters
   42          489   Genetics
   43          480   Nature Structural Biology
   44          442   Human Genetics
   45          441   Blood
   46          427   Current Genetics
   47          386   Molecular and Biochemical Parasitology
   48          375   Applied and Environmental Microbiology
   49          361   Journal of Clinical Investigation
   50          350   Developmental Biology
   51          348   Mammalian Genome
   52          346   Molecular Endocrinology
   53          344   Protein Science
   54          340   Cancer Research
   55          338   Molecular Biology of the Cell
   56          330   Immunogenetics
   57          326   The Plant Cell
   58          324   Acta Crystallographica, Section D
   59          321   Mechanisms of Development
   60          319   Neuron
   61          314   The Journal of Experimental Medicine
   62          312   Journal of Molecular Evolution
   63          307   DNA and Cell Biology
   64          306   Journal of Cell Science
   65          282   Biological Chemistry Hoppe-Seyler
   66          277   Journal of Neuroscience
   67          277   The Plant Journal
   68          276   Endocrinology
   69          268   DNA Sequence
   70          254   Journal of Neurochemistry
   71          243   Molecular Cell
   72          239   Journal of General Microbiology
   73          237   Brain Research. Molecular Brain Research
   74          236   Molecular Biology and Evolution
   75          235   The Journal of Clinical Endocrinology and Metabolism
   76          225   Toxicon
   77          218   Current Biology
   78          217   Bioscience, Biotechnology, and Biochemistry
   79          214   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
   80          212   American Journal of Physiology
   81          210   Cytogenetics and Cell Genetics
   82          205   Comparative Biochemistry and Physiology
   83          186   Molecular Pharmacology
   84          180   Antimicrobial Agents and Chemotherapy
   85          164   Proteins
   86          159   Journal of Investigative Dermatology
   87          158   DNA
   88          156   Journal of Medical Genetics
   89          154   DNA Research
   90          151   Peptides
   91          149   Tissue Antigens
   92          146   Molecular Plant-Microbe Interactions
   93          146   Genome Research
   94          146   Virus Research
   95          143   American Journal of Medical Genetics
   96          141   Biochimie
   97          138   Bioorganicheskaia Khimiia
   98          135   Hemoglobin
   99          130   European Journal of Immunology
  100          129   Molecular and Cellular Endocrinology
  101          126   Biology of Reproduction
  102          123   Plant and Cell Physiology
  103          116   Agricultural and Biological Chemistry
  104          115   Insect Biochemistry and Molecular Biology
  105          109   Archives of Microbiology
  106          105   General and Comparative Endocrinology
  107          105   Annals of Neurology
  108          103   Diabetes
  109          101   European Journal of Human Genetics
  110          101   Molecular Phylogenetics and Evolution


6.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                     331500              1.97
   Journal                          295405    158585    1.76
   Submitted to EMBL/GenBank/DDBJ    33350     28547    0.20
   Submitted to Swiss-Prot             619       616   <0.01
   Plant Gene Register                 495       484   <0.01
   Book citation                       483       471   <0.01
   Unpublished observations            444       440   <0.01
   Thesis                              280       278   <0.01
   Submitted to other databases        217       214   <0.01
   Patent                              118       116   <0.01
   Unpublished results                  83        81   <0.01
   Worm Breeder's Gazette                6         6   <0.01

Comments (CC)                       610556              3.63
   SIMILARITY                       174573    149217    1.04
   FUNCTION                         110731    108250    0.66
   SUBCELLULAR LOCATION              81853     81853    0.49
   CATALYTIC ACTIVITY                59345     55679    0.35
   SUBUNIT                           53482     53481    0.32
   PATHWAY                           28898     27301    0.17
   COFACTOR                          20107     20107    0.12
   TISSUE SPECIFICITY                18762     18762    0.11
   PTM                               11372     10101    0.07
   MISCELLANEOUS                      9674      8890    0.06
   DOMAIN                             6951      6128    0.04
   ALTERNATIVE PRODUCTS               6544      6544    0.04
   CAUTION                            5775      5209    0.03
   INDUCTION                          4721      4721    0.03
   DEVELOPMENTAL STAGE                4413      4413    0.03
   DISEASE                            2843      2087    0.02
   INTERACTION                        2606      2606    0.02
   ENZYME REGULATION                  2397      2397    0.01
   MASS SPECTROMETRY                  1600      1406    0.01
   DATABASE                           1481      1399    0.01
   BIOPHYSICOCHEMICAL PROPERTIES       793       793   <0.01
   POLYMORPHISM                        496       484   <0.01
   ALLERGEN                            375       375   <0.01
   RNA EDITING                         340       340   <0.01
   TOXIC DOSE                          263       262   <0.01
   BIOTECHNOLOGY                       110       110   <0.01
   PHARMACEUTICAL                       51        51   <0.01

Features (FT)                       951134              5.65
   DOMAIN                           137509     42734    0.82
   TRANSMEM                         106696     23186    0.63
   CONFLICT                          64076     22398    0.38
   METAL                             63755     15800    0.38
   TURN                              62445      4663    0.37
   STRAND                            57248      4166    0.34
   CARBOHYD                          56975     14081    0.34
   DISULFID                          52591     13918    0.31
   HELIX                             45087      4520    0.27
   ACT_SITE                          38281     22904    0.23
   REPEAT                            36216      5152    0.22
   VARIANT                           31599      6000    0.19
   CHAIN                             28442     23157    0.17
   NP_BIND                           23975     16553    0.14
   MOD_RES                           19066     10178    0.11
   SIGNAL                            18062     18060    0.11
   SITE                              15265      9051    0.09
   BINDING                           14746      9725    0.09
   VARSPLIC                          13053      5755    0.08
   ZN_FING                           10948      4044    0.07
   NON_TER                           10907      8300    0.06
   MUTAGEN                            9579      2606    0.06
   INIT_MET                           7510      7464    0.04
   PROPEP                             5846      4942    0.03
   DNA_BIND                           5179      4872    0.03
   LIPID                              5121      3374    0.03
   PEPTIDE                            3563      1599    0.02
   TRANSIT                            3059      3032    0.02
   CA_BIND                            2236       902    0.01
   NON_CONS                           1008       495    0.01
   CROSSLNK                            517       408   <0.01
   UNSURE                              383       156   <0.01
   SE_CYS                              191       134   <0.01

Cross-references (DR)              1666608              9.90
   InterPro                         341849    151755    2.03
   EMBL                             327282    160878    1.94
   Pfam                             196363    144251    1.17
   PROSITE                          150504     93796    0.89
   PIR                               91827     84791    0.55
   GO                                75177     21332    0.45
   HSSP                              69476     69476    0.41
   PRINTS                            60403     49140    0.36
   TIGRFAMs                          52285     48770    0.31
   HAMAP                             50708     50601    0.30
   ProDom                            45407     43563    0.27
   SMART                             41802     31654    0.25
   PDB                               24775      6745    0.15
   Ensembl                           22719     22718    0.13
   TIGR                              16617     16155    0.10
   Genew                             10935     10875    0.06
   MIM                               10379      8553    0.06
   MGD                                8327      8284    0.05
   IntAct                             7447      7447    0.04
   SGD                                5092      5031    0.03
   PIRSF                              5008      5001    0.03
   GermOnline                         4927      4877    0.03
   EcoGene                            4225      4223    0.03
   EchoBASE                           4159      4127    0.02
   H-InvDB                            3677      3659    0.02
   MEROPS                             3598      3507    0.02
   WormPep                            2990      2612    0.02
   RGD                                2886      2883    0.02
   FlyBase                            2747      2723    0.02
   GeneDB_SPombe                      2740      2710    0.02
   TRANSFAC                           2737      2455    0.02
   SubtiList                          2717      2716    0.02
   WormBase                           2672      2597    0.02
   TubercuList                        1427      1391    0.01
   StyGene                            1420      1417    0.01
   SWISS-2DPAGE                       1121      1121    0.01
   ListiList                           989       966    0.01
   Reactome                            717       717   <0.01
   GeneFarm                            625       624   <0.01
   Leproma                             616       612   <0.01
   Gramene                             569       564   <0.01
   MaizeDB                             419       414   <0.01
   ZFIN                                387       380   <0.01
   PhotoList                           372       372   <0.01
   HIV                                 370       354   <0.01
   REBASE                              366       361   <0.01
   OGP                                 364       364   <0.01
   ECO2DBASE                           351       299   <0.01
   DictyBase                           325       323   <0.01
   GlycoSuiteDB                        282       282   <0.01
   SagaList                            260       259   <0.01
   PHCI-2DPAGE                         239       239   <0.01
   AGD                                 200       194   <0.01
   MypuList                            170       170   <0.01
   Aarhus/Ghent-2DPAGE                 128        98   <0.01
   Siena-2DPAGE                        103       103   <0.01
   HSC-2DPAGE                           85        85   <0.01
   COMPLUYEAST-2DPAGE                   59        59   <0.01
   PhosSite                             54        54   <0.01
   PMMA-2DPAGE                          52        52   <0.01
   Maize-2DPAGE                         39        39   <0.01
   Rat-heart-2DPAGE                     28        28   <0.01
   ANU-2DPAGE                           14        14   <0.01

Number of explicitly cross-referenced databases: 64
Number of implicitly cross-referenced databases: 32


7.  MISCELLANEOUS STATISTICS

Total number of distinct authors cited in Swiss-Prot: 196818

Total number of entries encoded on a chloroplast: 3804
Total number of entries encoded on a mitochondrion: 2971
Total number of entries encoded on a cyanelle: 145
Total number of entries encoded on a plasmid: 2902

Number of fragments: 8457
Number of additional sequences encoded on splice variants: 10003


UniProt/TrEMBL protein database release 29.0 statistics


1.  INTRODUCTION

Release 29.0 of 01-Feb-2005 of UniProt/TrEMBL has been produced in synch
with UniProt/Swiss-Prot release 46 and EMBL/DDBJ/GenBank nucleotide
sequence database release 81 and updates until the 22-Jan-2005. It contains 
1'589'670 sequence entries, comprising 497'792'130 amino acids.

153'776 sequences have been added since release 28, and the sequence and 
annotation data of 115'996 entries have been updated. This represents an 
increase of 11.24%.

In the document delac_tr.txt, you will find a list of all accession numbers
which were previously present in UniProt/TrEMBL, but which have now been
deleted from the database. Most deletions are due to the deletion of the
corresponding CDS in the source nucleotide sequence databases EMBL-
Bank/DDBJ/GenBank. In addition, some entries are recognised to be Open
Reading frames (ORFs) that have been wrongly predicted to code for proteins.
When there is enough evidence that these hypothetical proteins are not real,
we take the decision to remove them from TrEMBL. 


2.  AMINO ACID COMPOSITION

   2.1  Composition in percent for the complete database

   Ala (A) 7.78   Gln (Q) 3.87   Leu (L) 9.74   Ser (S) 7.04
   Arg (R) 5.32   Glu (E) 6.07   Lys (K) 5.54   Thr (T) 5.73
   Asn (N) 4.44   Gly (G) 6.93   Met (M) 2.41   Trp (W) 1.37
   Asp (D) 5.10   His (H) 2.27   Phe (F) 4.14   Tyr (Y) 3.14
   Cys (C) 1.50   Ile (I) 6.01   Pro (P) 4.93   Val (V) 6.50

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.07


   2.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Ile, Thr, Lys, Arg, Asp, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp


3.  TAXONOMIC ORIGIN

   Total number of species represented in this release of 
   UniProt/TrEMBL: 84064

   The first twenty species represent 477233 sequences: 30 % of the
   total number of entries.


   3.1 Table of the frequency of occurrence of species

        Species represented 1x:41727
                            2x:15907
                            3x: 8040
                            4x: 4247
                            5x: 2466
                            6x: 1872
                            7x: 1230
                            8x: 1067
                            9x:  853
                           10x:  642
                       11- 20x: 2798
                       21- 50x: 1662
                       51-100x:  684
                         >100x:  869


   3.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     121308  Human immunodeficiency virus 1
       2      50385  Homo sapiens (Human)
       3      48975  Oryza sativa (japonica cultivar-group)
       4      38332  Arabidopsis thaliana (Mouse-ear cress)
       5      38286  Mus musculus (Mouse)
       6      24152  Drosophila melanogaster (Fruit fly)
       7      21503  Hepatitis C virus
       8      19983  Caenorhabditis elegans
       9      15229  Anopheles gambiae str. PEST
      10      13214  Caenorhabditis briggsae
      11      10987  Neurospora crassa
      12      10842  Brachydanio rerio (Zebrafish) (Danio rerio)
      13      10664  Xenopus laevis (African clawed frog)
      14       8177  Bradyrhizobium japonicum
      15       8088  Rattus norvegicus (Rat)
      16       7810  Plasmodium yoelii yoelii
      17       7578  Streptomyces coelicolor
      18       7429  Streptomyces avermitilis
      19       7194  Rhizobium loti (Mesorhizobium loti)
      20       7097  Rhodopirellula baltica
      21       7015  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
      22       6822  Hepatitis B virus
      23       6494  Yarrowia lipolytica (Candida lipolytica)
      24       6397  Giardia lamblia ATCC 50803
      25       6369  Pseudomonas aeruginosa
      26       6318  Bacillus anthracis
      27       6265  Debaryomyces hansenii (Yeast) (Torulaspora hansenii)
      28       6084  Escherichia coli
      29       5951  uncultured bacterium
      30       5911  Nocardia farcinica
      31       5857  Burkholderia pseudomallei (Pseudomonas pseudomallei)
      32       5692  Rhizobium meliloti (Sinorhizobium meliloti)
      33       5672  Bacillus cereus (strain ATCC 10987)
      34       5573  Anabaena sp. (strain PCC 7120)
      35       5242  Photobacterium profundum (Photobacterium sp. (strain SS9))
      36       5231  Plasmodium falciparum (isolate 3D7)
      37       5229  Kluyveromyces lactis (Yeast)
      38       5137  Candida glabrata (Yeast) (Torulopsis glabrata)
      39       5096  Bacillus cereus (strain ZK)
      40       5095  Helicobacter pylori (Campylobacter pylori)
      41       5017  Bacillus thuringiensis (subsp. konkukian)
      42       4993  Pseudomonas syringae (pv. tomato)
      43       4941  Escherichia coli O157:H7
      44       4847  Bacillus cereus (strain ATCC 14579 / DSM 31)
      45       4846  Bordetella bronchiseptica (Alcaligenes bronchisepticus)
      46       4832  Gallus gallus (Chicken)
      47       4824  Bacteroides fragilis
      48       4800  Pseudomonas putida (strain KT2440)
      49       4753  Yersinia pestis
      50       4723  Ralstonia solanacearum (Pseudomonas solanacearum)
      51       4689  Rhodopseudomonas palustris
      52       4634  Bacteroides thetaiotaomicron
      53       4628  Pongo pygmaeus (Orangutan)
      54       4623  Leptospira interrogans
      55       4585  Vibrio vulnificus (strain YJ016)
      56       4526  Ashbya gossypii ATCC 10895
      57       4515  Burkholderia mallei (Pseudomonas mallei)
      58       4496  Azoarcus sp. (strain EbN1)
      59       4419  Erwinia carotovora (subsp. atroseptica) (Pectobacterium atrosepticum)
      60       4395  Vibrio parahaemolyticus
      61       4317  Mycobacterium tuberculosis
      62       4291  Mycobacterium paratuberculosis
      63       4233  Silicibacter pomeroyi DSS-3
      64       4198  Gloeobacter violaceus
      65       4188  Photorhabdus luminescens (subsp. laumondii)
      66       4168  Shewanella oneidensis
      67       4158  Haloarcula marismortui (Halobacterium marismortui)
      68       4130  Chromobacterium violaceum
      69       4124  Yersinia pseudotuberculosis
      70       4094  Bacillus licheniformis (strain DSM 13 / ATCC 14580)
      71       4072  Salmonella enterica subsp. enterica serovar Paratypi A str. ATCC 9150
      72       4069  Methanosarcina acetivorans
      73       4067  Bacillus clausii (strain KSM-K16)
      74       4060  Salmonella typhi
      75       4029  Vibrio vulnificus
      76       3973  Escherichia coli O6
      77       3941  Vibrio cholerae
      78       3920  Xanthomonas axonopodis (pv. citri)
      79       3894  Bordetella parapertussis
      80       3858  Plasmodium falciparum
      81       3843  Bacillus licheniformis
      82       3839  Corynebacterium glutamicum (Brevibacterium flavum)
      83       3777  Salmonella typhimurium
      84       3771  Oryza sativa (Rice)
      85       3768  Shigella flexneri
      86       3759  Listeria monocytogenes
      87       3716  Xanthomonas campestris (pv. campestris)
      88       3570  Enterococcus faecalis (Streptococcus faecalis)
      89       3567  Bacillus halodurans
      90       3552  Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni)
      91       3535  Bdellovibrio bacteriovorus
      92       3511  Geobacillus kaustophilus HTA426
      93       3487  TT virus
      94       3441  Streptococcus pneumoniae
      95       3415  Clostridium acetobutylicum
      96       3393  Desulfovibrio vulgaris (strain Hildenborough / ATCC 29579 / NCIMB 8303)
      97       3325  Caulobacter crescentus
      98       3289  Geobacter sulfurreducens
      99       3283  Symbiobacterium thermophilum
     100       3269  Chimpanzee immunodeficiency virus (SIV(cpz)) (CIV)

   3.3  Distribution of the sequences by sections

   Division      sequences (% of the database)
   archaea           43134 ( 2.7%)
   fungi             62926 ( 4%)
   human             50385 ( 3.2%)
   invertebrates    184252 ( 11.6%)
   mammals           34073 ( 2.1%)
   plants           179409 ( 11.3%)
   bacteria         605632 ( 38.1%)
   rodents           55021 ( 3.5%)
   unclassified       1045 ( 0%)
   viruses          288453 ( 18%)
   vertebrates       85041 ( 5.3%)


4.  SEQUENCE SIZE

   4.1  Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50   18352             1001-1100     8773
                 51- 100   95681             1101-1200     6260
                101- 150  118102             1201-1300     4728
                151- 200  109209             1301-1400     3046
                201- 250  110494             1401-1500     2514
                251- 300  102539             1501-1600     1730
                301- 350   99602             1601-1700     1359
                351- 400   80912             1701-1800     1189
                401- 450   62563             1801-1900      944
                451- 500   54264             1901-2000      791
                501- 550   42499             2001-2100      607
                551- 600   29474             2101-2200      733
                601- 650   22620             2201-2300      612
                651- 700   17682             2301-2400      494
                701- 750   14980             2401-2500      322
                751- 800   12273             >2500         3046
                801- 850   10415
                851- 900    9233
                901- 950    6740
                951-1000    5475


   4.2  Longest and shortest sequences

   The shortest sequence is Q16047:     4 amino acids.
   The longest sequence is  Q8WZ42: 34350 amino acids.


5.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProt/TrEMBL 
lines, as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                    2220491              1.40
   Journal                         1395141   1163227    0.88
   Submitted to EMBL/GenBank/DDBJ   816582    627835    0.51
   Thesis                             4582      4530   <0.01
   Book citation                      3718      3674   <0.01
   Submitted to other databases        452       444   <0.01
   Unpublished results                  10        10   <0.01
   Unpublished observations              4         4   <0.01
   Plant Gene Register                   1         1   <0.01
   Patent                                1         1   <0.01

Comments (CC)                       835627              0.53
   SIMILARITY                       222175    218793    0.14
   FUNCTION                         143297    142581    0.09
   CATALYTIC ACTIVITY               136440    123511    0.09
   SUBCELLULAR LOCATION             126593    126592    0.08
   SUBUNIT                           65266     65258    0.04
   CAUTION                           47416     47413    0.03
   PATHWAY                           42505     42266    0.03
   COFACTOR                          38630     38630    0.02
   INTERACTION                        5097      5097   <0.01
   MISCELLANEOUS                      4142      4125   <0.01
   DOMAIN                             3454      3262   <0.01
   ALLERGEN                            163       163   <0.01
   TISSUE SPECIFICITY                  138       138   <0.01
   MASS SPECTROMETRY                   121        65   <0.01
   DEVELOPMENTAL STAGE                  55        55   <0.01
   INDUCTION                            45        45   <0.01
   PTM                                  38        37   <0.01
   ALTERNATIVE PRODUCTS                 38        38   <0.01
   ENZYME REGULATION                     8         8   <0.01
   POLYMORPHISM                          3         3   <0.01
   DISEASE                               3         3   <0.01

Features (FT)                       951302              0.60
   NON_TER                          895245    527251    0.56
   CHAIN                             39563     23647    0.02
   SIGNAL                            12522     12311    0.01
   NON_CONS                            929       432   <0.01
   TRANSIT                             582       578   <0.01
   CARBOHYD                            580       100   <0.01
   DOMAIN                              520       168   <0.01
   SE_CYS                              318       168   <0.01
   TRANSMEM                            229        52   <0.01
   REPEAT                              169        23   <0.01
   CONFLICT                            164        27   <0.01
   DISULFID                             98        34   <0.01
   VARSPLIC                             77        31   <0.01
   VARIANT                              53        13   <0.01
   METAL                                43        17   <0.01
   ACT_SITE                             43        29   <0.01
   UNSURE                               33        14   <0.01
   DNA_BIND                             30        24   <0.01
   NP_BIND                              23        19   <0.01
   MOD_RES                              22        12   <0.01
   ZN_FING                              16         8   <0.01
   PROPEP                               15        12   <0.01
   SITE                                 10        10   <0.01
   CA_BIND                               4         3   <0.01
   PEPTIDE                               4         4   <0.01
   BINDING                               3         3   <0.01
   LIPID                                 3         2   <0.01
   MUTAGEN                               3         2   <0.01
   INIT_MET                              1         1   <0.01

Cross-references (DR)             11393181              7.17
   GO                              3490371   1018322    2.20
   InterPro                        2053199   1165127    1.29
   EMBL                            1851113   1583287    1.16
   Pfam                            1456963   1099139    0.92
   PROSITE                          748989    488427    0.47
   PRINTS                           316136    262369    0.20
   HSSP                             295204    294924    0.19
   SMART                            273636    211019    0.17
   PIR                              198843    163073    0.13
   ProDom                           190432    182879    0.12
   TIGRFAMs                         161550    149520    0.10
   TIGR                              83793     77785    0.05
   Ensembl                           75459     75444    0.05
   Gramene                           45809     45808    0.03
   MGD                               25480     25478    0.02
   FlyBase                           23005     22734    0.01
   WormPep                           19282     19203    0.01
   WormBase                          19270     19203    0.01
   PIRSF                              9497      9497    0.01
   MEROPS                             8679      8395    0.01
   ZFIN                               6174      6171   <0.01
   IntAct                             5438      5438   <0.01
   ListiList                          4826      4809   <0.01
   AGD                                4491      4491   <0.01
   PhotoList                          4309      4185   <0.01
   Genew                              3568      3568   <0.01
   PDB                                2945      1720   <0.01
   RGD                                2594      2579   <0.01
   TubercuList                        2497      2491   <0.01
   GeneDB_SPombe                      2236      2221   <0.01
   SagaList                           1834      1740   <0.01
   SGD                                1435      1434   <0.01
   TRANSFAC                           1042      1028   <0.01
   Leproma                             991       989   <0.01
   DictyBase                           980       980   <0.01
   MypuList                            612       608   <0.01
   REBASE                              126       121   <0.01
   PHCI-2DPAGE                         108       108   <0.01
   SWISS-2DPAGE                         98        98   <0.01
   ANU-2DPAGE                           74        74   <0.01
   Reactome                             34        34   <0.01
   OGP                                  29        29   <0.01
   PhosSite                             12        12   <0.01
   MIM                                  12        11   <0.01
   PMMA-2DPAGE                           3         3   <0.01
   Siena-2DPAGE                          2         2   <0.01
   COMPLUYEAST-2DPAGE                    1         1   <0.01


6.  MISCELLANEOUS STATISTICS

Total number of distinct authors cited in UniProt/TrEMBL: 205506

Total number of entries encoded on a chloroplast: 39087
Total number of entries encoded on a mitochondrion: 91928
Total number of entries encoded on a plasmid: 32361

Number of additional sequences encoded on splice variants: 57


Submissions and Updates

We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available.

Submit new sequence data, updates and corrections at http://www.uniprot.org/support/submissions.shtml

For all queries regarding submissions to UniProt and to submit new protein sequence data, please contact:

UniProt Knowledgebase
The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: (+44 1223) 494 462
Telefax: (+44 1223) 494 468
E-mail:


Download information

Bi-Weekly releases

The latest data of the UniProt Knowledgebase is available in various format (flatfile, XML or FASTA) at http://www.uniprot.org/database/download.shtml. The data is further supplemented by two files containing the sequences of all additional splice isoforms annotated in UniProt/Swiss-Prot and UniProt/TrEMBL. These data sets are documented in the file ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/README.varsplic

Major releases

For users who wish to download the UniProt Knowledgebase only occasionally, we distribute the latest major release (updated 4 times per year) in flatfile format. Previous UniProt/Swiss-Prot and UniProt/TrEMBL are archived under ftp://ftp.uniprot.org/databases/uniprot/previous_major_releases The UniProt Knowledgebase major release is also available on CD-ROM from the EBI.


Contact

EMBL Outstation
European Bioinformatics Institute (EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: (+44 1223) 494 444
Fax: (+44 1223) 494 468
Electronic mail address: /
WWW server: http://www.ebi.ac.uk/


Swiss Institute of Bioinformatics (SIB)
Centre Medical Universitaire
1, rue Michel Servet
1211 Geneva 4
Switzerland

Telephone: (+41 22) 702 50 50
Fax: (+41 22) 702 58 58
Electronic mail address:
WWW server: http://www.expasy.org/


Protein Information Resource (PIR)
Georgetown University Medical Center
3900 Reservoir Road, NW
Box 571455
Washington, DC 20057-1455
United States of America

Telephone: (+1 202) 687 1039
Fax: (+1 202) 687 0057)
Electronic mail address:
WWW server: http://pir.georgetown.edu

Citation

If you want to cite UniProt in a publication please use the following reference:

Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S., The Universal Protein Resource (UniProt), Nucleic Acids Res. 33: D154-D159 (2005).