![]() |
UniProt Knowledgebase Release notes UniProt release 4.0 of 1-Feb-2005 |
| Content |
|---|
Related documents: UniProt user manual, Recent changes, Forthcoming changes.
| Introduction |
|---|
Release 4.0 of the UniProt Knowledgebase is composed of the UniProt/Swiss-Prot Protein Knowledgebase release 46.0 and the UniProt/TrEMBL Protein Database release 29.0.
More information on these databases can be found in the user manual What is the UniProt Knowledgebase ?.
| UniProt/Swiss-Prot protein knowledgebase release 46.0 statistics |
|---|
Release 46.0 of 01-Feb-2005 of UniProt/Swiss-Prot contains 168'297 sequence entries, comprising 61'443'278 amino acids abstracted from 124'910 references.
The growth of the database is summarized below.
| Release | Date | Number of entries | Number of amino acids |
|---|---|---|---|
| 2.0 | 09/86 | 3'939 | 900'163 |
| 3.0 | 11/86 | 4'160 | 969'641 |
| 4.0 | 04/87 | 4'387 | 1'036'010 |
| 5.0 | 09/87 | 5'205 | 1'327'683 |
| 6.0 | 01/88 | 6'102 | 1'653'982 |
| 7.0 | 04/88 | 6'821 | 1'885'771 |
| 8.0 | 08/88 | 7'724 | 2'224'465 |
| 9.0 | 11/88 | 8'702 | 2'498'140 |
| 10.0 | 03/89 | 10'008 | 2'952'613 |
| 11.0 | 07/89 | 10'856 | 3'265'966 |
| 12.0 | 10/89 | 12'305 | 3'797'482 |
| 13.0 | 01/90 | 13'837 | 4'347'336 |
| 14.0 | 04/90 | 15'409 | 4'914'264 |
| 15.0 | 08/90 | 16'941 | 5'486'399 |
| 16.0 | 11/90 | 18'364 | 5'986'949 |
| 17.0 | 02/91 | 20'024 | 6'524'504 |
| 18.0 | 05/91 | 20'772 | 6'792'034 |
| 19.0 | 08/91 | 21'795 | 7'173'785 |
| 20.0 | 11/91 | 22'654 | 7'500'130 |
| 21.0 | 03/92 | 23'742 | 7'866'596 |
| 22.0 | 05/92 | 25'044 | 8'375'696 |
| 23.0 | 08/92 | 26'706 | 9'011'391 |
| 24.0 | 12/92 | 28'154 | 9'545'427 |
| 25.0 | 04/93 | 29'955 | 10'214'020 |
| 26.0 | 07/93 | 31'808 | 10'875'091 |
| 27.0 | 10/93 | 33'329 | 11'484'420 |
| 28.0 | 02/94 | 36'000 | 12'496'420 |
| 29.0 | 06/94 | 38'303 | 13'464'008 |
| 30.0 | 10/94 | 40'292 | 14'147'368 |
| 31.0 | 02/95 | 43'470 | 15'335'248 |
| 32.0 | 11/95 | 49'340 | 17'385'503 |
| 33.0 | 02/96 | 52'205 | 18'531'384 |
| 34.0 | 10/96 | 59'021 | 21'210'389 |
| 35.0 | 11/97 | 69'113 | 25'083'768 |
| 36.0 | 07/98 | 74'019 | 26'840'295 |
| 37.0 | 12/98 | 77'977 | 28'268'293 |
| 38.0 | 07/99 | 80'000 | 29'085'965 |
| 39.0 | 05/00 | 86'593 | 31'411'114 |
| 40.0 | 10/01 | 101'602 | 37'315'215 |
| 41.0 | 02/03 | 122'564 | 44'986'459 |
| 42.0 | 10/03 | 135'850 | 50'046'799 |
| 43.0 | 03/04 | 146'720 | 54'093'154 |
| 44.0 | 07/04 | 153'871 | 56'608'159 |
| 45.0 | 10/04 | 163'235 | 59'631'787 |
| 46.0 | 02/05 | 168'297 | 61'443'278 |
In rare cases, Swiss-Prot entries are removed. Deleted entries are almost exclusively Open Reading Frames (ORFs) that have been wrongly predicted to code for proteins. When there is enough evidence that these hypothetical proteins are not real we take the decision to remove them from Swiss-Prot. In the document delac_sp.txt, you will find a list of all accession numbers which were previously present in UniProt/Swiss-Prot, but which have now been deleted from the database.
We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:
From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:
| Organism | Database cross-references | Index file | Number of sequences |
|---|---|---|---|
| A.thaliana | None yet | arath.txt | 3'110 |
| C.albicans | None yet | calbican.txt | 321 |
| C.elegans | Wormpep | celegans.txt | 2'615 |
| D.discoideum | DictyBase | dicty.txt | 324 |
| D.melanogaster | FlyBase | fly.txt | 2'158 |
| M.musculus | MGD | mgdtosp.txt | 8'676 |
| S.cerevisiae | SGD | yeast.txt | 5'042 |
| S.pombe | GeneDB_SPombe | pombe.txt | 2'712 |
1. INTRODUCTION
Release 46.0 of 01-Feb-2005 of UniProt/Swiss-Prot contains 168297 sequence entries,
comprising 61443278 amino acids abstracted from 124910 references.
4537 sequences have been added since release 45, the sequence data of
866 existing entries has been updated and the annotations of
77494 entries have been revised. This represents an increase of 3%.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.81 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.88
Arg (R) 5.32 Glu (E) 6.61 Lys (K) 5.93 Thr (T) 5.45
Asn (N) 4.20 Gly (G) 6.93 Met (M) 2.37 Trp (W) 1.15
Asp (D) 5.30 His (H) 2.28 Phe (F) 4.00 Tyr (Y) 3.07
Cys (C) 1.56 Ile (I) 5.91 Pro (P) 4.84 Val (V) 6.71
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01
2.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of Swiss-Prot: 8826
The first twenty species represent 62418 sequences: 37.1 % of the total
number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x: 4171
2x: 1390
3x: 699
4x: 460
5x: 289
6x: 265
7x: 195
8x: 155
9x: 129
10x: 83
11- 20x: 371
21- 50x: 293
51-100x: 96
>100x: 230
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 11850 Homo sapiens (Human)
2 8676 Mus musculus (Mouse)
3 5042 Saccharomyces cerevisiae (Baker's yeast)
4 4838 Escherichia coli
5 4079 Rattus norvegicus (Rat)
6 3110 Arabidopsis thaliana (Mouse-ear cress)
7 2767 Bacillus subtilis
8 2712 Schizosaccharomyces pombe (Fission yeast)
9 2615 Caenorhabditis elegans
10 2158 Drosophila melanogaster (Fruit fly)
11 1782 Methanococcus jannaschii
12 1773 Haemophilus influenzae
13 1707 Escherichia coli O157:H7
14 1521 Bos taurus (Bovine)
15 1468 Salmonella typhimurium
16 1399 Mycobacterium tuberculosis
17 1368 Escherichia coli O6
18 1328 Shigella flexneri
19 1128 Gallus gallus (Chicken)
20 1097 Mycobacterium bovis
21 1051 Salmonella typhi
22 1012 Pseudomonas aeruginosa
23 958 Synechocystis sp. (strain PCC 6803)
24 955 Archaeoglobus fulgidus
25 923 Sus scrofa (Pig)
26 908 Xenopus laevis (African clawed frog)
27 807 Rhizobium meliloti (Sinorhizobium meliloti)
28 792 Vibrio cholerae
29 766 Yersinia pestis
30 747 Oryctolagus cuniculus (Rabbit)
31 745 Aquifex aeolicus
32 687 Mycoplasma pneumoniae
33 681 Pasteurella multocida
34 629 Vibrio parahaemolyticus
35 628 Streptomyces coelicolor
36 617 Bacillus halodurans
37 612 Mycobacterium leprae
38 606 Treponema pallidum
39 578 Vibrio vulnificus
40 573 Methanobacterium thermoautotrophicum
41 572 Buchnera aphidicola (subsp. Acyrthosiphon pisum)
42 568 Anabaena sp. (strain PCC 7120)
43 562 Helicobacter pylori (Campylobacter pylori)
44 561 Buchnera aphidicola (subsp. Schizaphis graminum)
45 549 Staphylococcus aureus (strain Mu50 / ATCC 700699)
46 547 Staphylococcus aureus (strain N315)
47 546 Rickettsia prowazekii
48 543 Helicobacter pylori J99 (Campylobacter pylori J99)
49 530 Staphylococcus aureus (strain MW2)
50 517 Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
51 514 Pseudomonas putida (strain KT2440)
52 513 Zea mays (Maize)
53 508 Pseudomonas syringae (pv. tomato)
54 507 Buchnera aphidicola (subsp. Baizongia pistaciae)
55 499 Staphylococcus epidermidis
56 499 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
57 499 Ralstonia solanacearum (Pseudomonas solanacearum)
58 496 Listeria monocytogenes
59 492 Listeria innocua
60 486 Mycoplasma genitalium
61 486 Rhizobium loti (Mesorhizobium loti)
62 482 Xanthomonas campestris (pv. campestris)
63 481 Neisseria meningitidis (serogroup B)
64 479 Neisseria meningitidis (serogroup A)
65 472 Clostridium acetobutylicum
66 467 Bradyrhizobium japonicum
67 464 Bacillus anthracis
68 463 Caulobacter crescentus
69 462 Canis familiaris (Dog)
70 461 Thermotoga maritima
71 444 Xanthomonas axonopodis (pv. citri)
72 442 Streptococcus pneumoniae
73 438 Oryza sativa (Rice)
74 438 Xylella fastidiosa
75 432 Deinococcus radiodurans
76 428 Pyrococcus horikoshii
77 428 Chlamydia trachomatis
78 426 Xylella fastidiosa (strain Temecula1 / ATCC 700964)
79 424 Pyrococcus abyssi
80 419 Shewanella oneidensis
81 417 Borrelia burgdorferi (Lyme disease spirochete)
82 411 Brucella melitensis
83 411 Brucella suis
84 410 Methanosarcina acetivorans
85 410 Chlamydia pneumoniae (Chlamydophila pneumoniae)
86 410 Clostridium perfringens
87 405 Vibrio vulnificus (strain YJ016)
88 403 Rhizobium sp. (strain NGR234)
89 400 Chlamydia muridarum
90 396 Corynebacterium glutamicum (Brevibacterium flavum)
91 395 Methanosarcina mazei (Methanosarcina frisia)
92 394 Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
93 394 Bacillus cereus (strain ATCC 14579 / DSM 31)
94 393 Brachydanio rerio (Zebrafish) (Danio rerio)
95 384 Pyrococcus furiosus
96 380 Oceanobacillus iheyensis
97 378 Campylobacter jejuni
98 378 Sulfolobus solfataricus
99 377 Thermoanaerobacter tengcongensis
100 372 Photorhabdus luminescens (subsp. laumondii)
101 372 Neurospora crassa
102 371 Ovis aries (Sheep)
103 371 Lactobacillus plantarum
104 366 Nicotiana tabacum (Common tobacco)
105 365 Streptococcus pyogenes
106 360 Streptococcus pneumoniae (strain ATCC BAA-255 / R6)
107 359 Rickettsia conorii
108 348 Synechococcus elongatus (Thermosynechococcus elongatus)
109 344 Streptococcus mutans
110 335 Aeropyrum pernix
111 331 Chlorobium tepidum
112 324 Dictyostelium discoideum (Slime mold)
113 322 Streptococcus pyogenes (serotype M18)
114 321 Candida albicans (Yeast)
115 317 Streptococcus pyogenes (serotype M3)
116 314 Methanopyrus kandleri
117 313 Staphylococcus aureus
118 307 Enterococcus faecalis (Streptococcus faecalis)
119 304 Pan troglodytes (Chimpanzee)
120 303 Sulfolobus tokodaii
121 302 Pisum sativum (Garden pea)
122 293 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
123 292 Bordetella pertussis
124 290 Thermoplasma acidophilum
125 288 Haemophilus ducreyi
126 283 Corynebacterium efficiens
127 283 Triticum aestivum (Wheat)
128 282 Bordetella parapertussis
129 279 Streptomyces avermitilis
130 278 Staphylococcus aureus (strain MRSA252)
131 277 Staphylococcus aureus (strain MSSA476)
132 276 Chromobacterium violaceum
133 273 Fusobacterium nucleatum (subsp. nucleatum)
134 272 Hordeum vulgare (Barley)
135 268 Bacteriophage T4
136 266 Nitrosomonas europaea
137 264 Glycine max (Soybean)
138 261 Lycopersicon esculentum (Tomato)
139 261 Streptococcus agalactiae (serotype V)
140 259 Streptococcus agalactiae (serotype III)
141 258 Leptospira interrogans
142 257 Cavia porcellus (Guinea pig)
143 256 Solanum tuberosum (Potato)
144 255 Thermoplasma volcanium
145 254 Rhodobacter capsulatus (Rhodopseudomonas capsulata)
146 254 Vaccinia virus (strain Copenhagen) (VACV)
147 254 Pyrobaculum aerophilum
148 248 Pseudomonas putida
149 240 Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
150 238 Spinacia oleracea (Spinach)
151 233 Bacillus stearothermophilus
152 221 Clostridium tetani
153 221 Wigglesworthia glossinidia brevipalpis
154 220 Porphyra purpurea
155 220 Chlamydophila caviae
156 218 Coxiella burnetii
157 218 Gloeobacter violaceus
158 216 Synechococcus sp. (strain WH8102)
159 212 Kluyveromyces lactis (Yeast)
160 212 Chlamydomonas reinhardtii
161 210 Prochlorococcus marinus
162 210 Bacteroides thetaiotaomicron
163 209 Macaca mulatta (Rhesus macaque)
164 208 Equus caballus (Horse)
165 207 Prochlorococcus marinus (strain MIT 9313)
166 206 Klebsiella pneumoniae
167 204 Macaca fascicularis (Crab eating macaque) (Cynomolgus monkey)
168 200 Vaccinia virus (strain Western Reserve / WR) (VACV)
3.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 9025 ( 5%)
Bacteria 73807 ( 44%)
Eukaryota 76388 ( 45%)
Viruses 9077 ( 5%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 11850 ( 16%) ( 7%)
Other Mammalia 21659 ( 28%) ( 13%)
Other Vertebrata 7019 ( 9%) ( 4%)
Viridiplantae 11826 ( 15%) ( 7%)
Fungi 11327 ( 15%) ( 7%)
Insecta 4177 ( 5%) ( 2%)
Nematoda 2880 ( 4%) ( 2%)
Other 5650 ( 7%) ( 3%)
4. SEQUENCE SIZE
Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 3303 1001-1100 1432
51- 100 11821 1101-1200 1035
101- 150 17104 1201-1300 739
151- 200 15970 1301-1400 552
201- 250 16646 1401-1500 438
251- 300 14263 1501-1600 277
301- 350 15036 1601-1700 209
351- 400 13286 1701-1800 158
401- 450 10277 1801-1900 173
451- 500 8760 1901-2000 140
501- 550 6626 2001-2100 84
551- 600 4573 2101-2200 127
601- 650 3841 2201-2300 115
651- 700 2671 2301-2400 71
701- 750 2259 2401-2500 63
751- 800 1926 >2500 445
801- 850 1541
851- 900 1697
901- 950 1183
951-1000 999
The average sequence length in Swiss-Prot is 365 amino acids.
The shortest sequence is GWA_SEPOF (P83570): 2 amino acids.
The longest sequence is SYNE1_HUMAN (Q8NF91): 8797 amino acids.
5. JOURNAL CITATIONS
Note: the following citation statistics reflect the number of distinct
journal citations.
Total number of journals cited in this release of Swiss-Prot: 1551
5.1 Table of the frequency of journal citations
Journals cited 1x: 567
2x: 212
3x: 102
4x: 68
5x: 62
6x: 34
7x: 33
8x: 30
9x: 20
10x: 17
11- 20x: 118
21- 50x: 123
51-100x: 55
>100x: 110
5.2 List of the most cited journals in Swiss-Prot
Nb Citations Journal name
-- --------- -------------------------------------------------------------
1 11442 Journal of Biological Chemistry
2 5878 Proceedings of the National Academy of Sciences of the U.S.A.
3 4050 Journal of Bacteriology
4 3813 Nucleic Acids Research
5 3789 Gene
6 3152 Biochemical and Biophysical Research Communications
7 3125 FEBS Letters
8 2802 Biochemistry
9 2751 European Journal of Biochemistry
10 2612 The EMBO Journal
11 2403 Nature
12 2358 Biochimica et Biophysica Acta
13 2134 Journal of Molecular Biology
14 2031 Genomics
15 1927 Molecular and Cellular Biology
16 1912 Cell
17 1542 Biochemical Journal
18 1422 Science
19 1268 Molecular Microbiology
20 1216 Plant Molecular Biology
21 1209 Molecular and General Genetics
22 980 Journal of Biochemistry
23 936 Journal of Cell Biology
24 914 Virology
25 910 Human Molecular Genetics
26 838 Nature Genetics
27 762 Genes and Development
28 751 Journal of Virology
29 722 The American Journal of Human Genetics
30 714 Oncogene
31 687 Plant Physiology
32 683 Human Mutation
33 631 Journal of Immunology
34 620 Infection and Immunity
35 612 Archives of Biochemistry and Biophysics
36 601 Yeast
37 587 Structure
38 553 Journal of General Virology
39 538 Development
40 529 Microbiology
41 505 FEMS Microbiology Letters
42 489 Genetics
43 480 Nature Structural Biology
44 442 Human Genetics
45 441 Blood
46 427 Current Genetics
47 386 Molecular and Biochemical Parasitology
48 375 Applied and Environmental Microbiology
49 361 Journal of Clinical Investigation
50 350 Developmental Biology
51 348 Mammalian Genome
52 346 Molecular Endocrinology
53 344 Protein Science
54 340 Cancer Research
55 338 Molecular Biology of the Cell
56 330 Immunogenetics
57 326 The Plant Cell
58 324 Acta Crystallographica, Section D
59 321 Mechanisms of Development
60 319 Neuron
61 314 The Journal of Experimental Medicine
62 312 Journal of Molecular Evolution
63 307 DNA and Cell Biology
64 306 Journal of Cell Science
65 282 Biological Chemistry Hoppe-Seyler
66 277 Journal of Neuroscience
67 277 The Plant Journal
68 276 Endocrinology
69 268 DNA Sequence
70 254 Journal of Neurochemistry
71 243 Molecular Cell
72 239 Journal of General Microbiology
73 237 Brain Research. Molecular Brain Research
74 236 Molecular Biology and Evolution
75 235 The Journal of Clinical Endocrinology and Metabolism
76 225 Toxicon
77 218 Current Biology
78 217 Bioscience, Biotechnology, and Biochemistry
79 214 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
80 212 American Journal of Physiology
81 210 Cytogenetics and Cell Genetics
82 205 Comparative Biochemistry and Physiology
83 186 Molecular Pharmacology
84 180 Antimicrobial Agents and Chemotherapy
85 164 Proteins
86 159 Journal of Investigative Dermatology
87 158 DNA
88 156 Journal of Medical Genetics
89 154 DNA Research
90 151 Peptides
91 149 Tissue Antigens
92 146 Molecular Plant-Microbe Interactions
93 146 Genome Research
94 146 Virus Research
95 143 American Journal of Medical Genetics
96 141 Biochimie
97 138 Bioorganicheskaia Khimiia
98 135 Hemoglobin
99 130 European Journal of Immunology
100 129 Molecular and Cellular Endocrinology
101 126 Biology of Reproduction
102 123 Plant and Cell Physiology
103 116 Agricultural and Biological Chemistry
104 115 Insect Biochemistry and Molecular Biology
105 109 Archives of Microbiology
106 105 General and Comparative Endocrinology
107 105 Annals of Neurology
108 103 Diabetes
109 101 European Journal of Human Genetics
110 101 Molecular Phylogenetics and Evolution
6. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 331500 1.97
Journal 295405 158585 1.76
Submitted to EMBL/GenBank/DDBJ 33350 28547 0.20
Submitted to Swiss-Prot 619 616 <0.01
Plant Gene Register 495 484 <0.01
Book citation 483 471 <0.01
Unpublished observations 444 440 <0.01
Thesis 280 278 <0.01
Submitted to other databases 217 214 <0.01
Patent 118 116 <0.01
Unpublished results 83 81 <0.01
Worm Breeder's Gazette 6 6 <0.01
Comments (CC) 610556 3.63
SIMILARITY 174573 149217 1.04
FUNCTION 110731 108250 0.66
SUBCELLULAR LOCATION 81853 81853 0.49
CATALYTIC ACTIVITY 59345 55679 0.35
SUBUNIT 53482 53481 0.32
PATHWAY 28898 27301 0.17
COFACTOR 20107 20107 0.12
TISSUE SPECIFICITY 18762 18762 0.11
PTM 11372 10101 0.07
MISCELLANEOUS 9674 8890 0.06
DOMAIN 6951 6128 0.04
ALTERNATIVE PRODUCTS 6544 6544 0.04
CAUTION 5775 5209 0.03
INDUCTION 4721 4721 0.03
DEVELOPMENTAL STAGE 4413 4413 0.03
DISEASE 2843 2087 0.02
INTERACTION 2606 2606 0.02
ENZYME REGULATION 2397 2397 0.01
MASS SPECTROMETRY 1600 1406 0.01
DATABASE 1481 1399 0.01
BIOPHYSICOCHEMICAL PROPERTIES 793 793 <0.01
POLYMORPHISM 496 484 <0.01
ALLERGEN 375 375 <0.01
RNA EDITING 340 340 <0.01
TOXIC DOSE 263 262 <0.01
BIOTECHNOLOGY 110 110 <0.01
PHARMACEUTICAL 51 51 <0.01
Features (FT) 951134 5.65
DOMAIN 137509 42734 0.82
TRANSMEM 106696 23186 0.63
CONFLICT 64076 22398 0.38
METAL 63755 15800 0.38
TURN 62445 4663 0.37
STRAND 57248 4166 0.34
CARBOHYD 56975 14081 0.34
DISULFID 52591 13918 0.31
HELIX 45087 4520 0.27
ACT_SITE 38281 22904 0.23
REPEAT 36216 5152 0.22
VARIANT 31599 6000 0.19
CHAIN 28442 23157 0.17
NP_BIND 23975 16553 0.14
MOD_RES 19066 10178 0.11
SIGNAL 18062 18060 0.11
SITE 15265 9051 0.09
BINDING 14746 9725 0.09
VARSPLIC 13053 5755 0.08
ZN_FING 10948 4044 0.07
NON_TER 10907 8300 0.06
MUTAGEN 9579 2606 0.06
INIT_MET 7510 7464 0.04
PROPEP 5846 4942 0.03
DNA_BIND 5179 4872 0.03
LIPID 5121 3374 0.03
PEPTIDE 3563 1599 0.02
TRANSIT 3059 3032 0.02
CA_BIND 2236 902 0.01
NON_CONS 1008 495 0.01
CROSSLNK 517 408 <0.01
UNSURE 383 156 <0.01
SE_CYS 191 134 <0.01
Cross-references (DR) 1666608 9.90
InterPro 341849 151755 2.03
EMBL 327282 160878 1.94
Pfam 196363 144251 1.17
PROSITE 150504 93796 0.89
PIR 91827 84791 0.55
GO 75177 21332 0.45
HSSP 69476 69476 0.41
PRINTS 60403 49140 0.36
TIGRFAMs 52285 48770 0.31
HAMAP 50708 50601 0.30
ProDom 45407 43563 0.27
SMART 41802 31654 0.25
PDB 24775 6745 0.15
Ensembl 22719 22718 0.13
TIGR 16617 16155 0.10
Genew 10935 10875 0.06
MIM 10379 8553 0.06
MGD 8327 8284 0.05
IntAct 7447 7447 0.04
SGD 5092 5031 0.03
PIRSF 5008 5001 0.03
GermOnline 4927 4877 0.03
EcoGene 4225 4223 0.03
EchoBASE 4159 4127 0.02
H-InvDB 3677 3659 0.02
MEROPS 3598 3507 0.02
WormPep 2990 2612 0.02
RGD 2886 2883 0.02
FlyBase 2747 2723 0.02
GeneDB_SPombe 2740 2710 0.02
TRANSFAC 2737 2455 0.02
SubtiList 2717 2716 0.02
WormBase 2672 2597 0.02
TubercuList 1427 1391 0.01
StyGene 1420 1417 0.01
SWISS-2DPAGE 1121 1121 0.01
ListiList 989 966 0.01
Reactome 717 717 <0.01
GeneFarm 625 624 <0.01
Leproma 616 612 <0.01
Gramene 569 564 <0.01
MaizeDB 419 414 <0.01
ZFIN 387 380 <0.01
PhotoList 372 372 <0.01
HIV 370 354 <0.01
REBASE 366 361 <0.01
OGP 364 364 <0.01
ECO2DBASE 351 299 <0.01
DictyBase 325 323 <0.01
GlycoSuiteDB 282 282 <0.01
SagaList 260 259 <0.01
PHCI-2DPAGE 239 239 <0.01
AGD 200 194 <0.01
MypuList 170 170 <0.01
Aarhus/Ghent-2DPAGE 128 98 <0.01
Siena-2DPAGE 103 103 <0.01
HSC-2DPAGE 85 85 <0.01
COMPLUYEAST-2DPAGE 59 59 <0.01
PhosSite 54 54 <0.01
PMMA-2DPAGE 52 52 <0.01
Maize-2DPAGE 39 39 <0.01
Rat-heart-2DPAGE 28 28 <0.01
ANU-2DPAGE 14 14 <0.01
Number of explicitly cross-referenced databases: 64
Number of implicitly cross-referenced databases: 32
7. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in Swiss-Prot: 196818
Total number of entries encoded on a chloroplast: 3804
Total number of entries encoded on a mitochondrion: 2971
Total number of entries encoded on a cyanelle: 145
Total number of entries encoded on a plasmid: 2902
Number of fragments: 8457
Number of additional sequences encoded on splice variants: 10003
| UniProt/TrEMBL protein database release 29.0 statistics |
|---|
1. INTRODUCTION
Release 29.0 of 01-Feb-2005 of UniProt/TrEMBL has been produced in synch
with UniProt/Swiss-Prot release 46 and EMBL/DDBJ/GenBank nucleotide
sequence database release 81 and updates until the 22-Jan-2005. It contains
1'589'670 sequence entries, comprising 497'792'130 amino acids.
153'776 sequences have been added since release 28, and the sequence and
annotation data of 115'996 entries have been updated. This represents an
increase of 11.24%.
In the document delac_tr.txt, you will find a list of all accession numbers
which were previously present in UniProt/TrEMBL, but which have now been
deleted from the database. Most deletions are due to the deletion of the
corresponding CDS in the source nucleotide sequence databases EMBL-
Bank/DDBJ/GenBank. In addition, some entries are recognised to be Open
Reading frames (ORFs) that have been wrongly predicted to code for proteins.
When there is enough evidence that these hypothetical proteins are not real,
we take the decision to remove them from TrEMBL.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.78 Gln (Q) 3.87 Leu (L) 9.74 Ser (S) 7.04
Arg (R) 5.32 Glu (E) 6.07 Lys (K) 5.54 Thr (T) 5.73
Asn (N) 4.44 Gly (G) 6.93 Met (M) 2.41 Trp (W) 1.37
Asp (D) 5.10 His (H) 2.27 Phe (F) 4.14 Tyr (Y) 3.14
Cys (C) 1.50 Ile (I) 6.01 Pro (P) 4.93 Val (V) 6.50
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.07
2.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Ile, Thr, Lys, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of
UniProt/TrEMBL: 84064
The first twenty species represent 477233 sequences: 30 % of the
total number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x:41727
2x:15907
3x: 8040
4x: 4247
5x: 2466
6x: 1872
7x: 1230
8x: 1067
9x: 853
10x: 642
11- 20x: 2798
21- 50x: 1662
51-100x: 684
>100x: 869
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 121308 Human immunodeficiency virus 1
2 50385 Homo sapiens (Human)
3 48975 Oryza sativa (japonica cultivar-group)
4 38332 Arabidopsis thaliana (Mouse-ear cress)
5 38286 Mus musculus (Mouse)
6 24152 Drosophila melanogaster (Fruit fly)
7 21503 Hepatitis C virus
8 19983 Caenorhabditis elegans
9 15229 Anopheles gambiae str. PEST
10 13214 Caenorhabditis briggsae
11 10987 Neurospora crassa
12 10842 Brachydanio rerio (Zebrafish) (Danio rerio)
13 10664 Xenopus laevis (African clawed frog)
14 8177 Bradyrhizobium japonicum
15 8088 Rattus norvegicus (Rat)
16 7810 Plasmodium yoelii yoelii
17 7578 Streptomyces coelicolor
18 7429 Streptomyces avermitilis
19 7194 Rhizobium loti (Mesorhizobium loti)
20 7097 Rhodopirellula baltica
21 7015 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
22 6822 Hepatitis B virus
23 6494 Yarrowia lipolytica (Candida lipolytica)
24 6397 Giardia lamblia ATCC 50803
25 6369 Pseudomonas aeruginosa
26 6318 Bacillus anthracis
27 6265 Debaryomyces hansenii (Yeast) (Torulaspora hansenii)
28 6084 Escherichia coli
29 5951 uncultured bacterium
30 5911 Nocardia farcinica
31 5857 Burkholderia pseudomallei (Pseudomonas pseudomallei)
32 5692 Rhizobium meliloti (Sinorhizobium meliloti)
33 5672 Bacillus cereus (strain ATCC 10987)
34 5573 Anabaena sp. (strain PCC 7120)
35 5242 Photobacterium profundum (Photobacterium sp. (strain SS9))
36 5231 Plasmodium falciparum (isolate 3D7)
37 5229 Kluyveromyces lactis (Yeast)
38 5137 Candida glabrata (Yeast) (Torulopsis glabrata)
39 5096 Bacillus cereus (strain ZK)
40 5095 Helicobacter pylori (Campylobacter pylori)
41 5017 Bacillus thuringiensis (subsp. konkukian)
42 4993 Pseudomonas syringae (pv. tomato)
43 4941 Escherichia coli O157:H7
44 4847 Bacillus cereus (strain ATCC 14579 / DSM 31)
45 4846 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
46 4832 Gallus gallus (Chicken)
47 4824 Bacteroides fragilis
48 4800 Pseudomonas putida (strain KT2440)
49 4753 Yersinia pestis
50 4723 Ralstonia solanacearum (Pseudomonas solanacearum)
51 4689 Rhodopseudomonas palustris
52 4634 Bacteroides thetaiotaomicron
53 4628 Pongo pygmaeus (Orangutan)
54 4623 Leptospira interrogans
55 4585 Vibrio vulnificus (strain YJ016)
56 4526 Ashbya gossypii ATCC 10895
57 4515 Burkholderia mallei (Pseudomonas mallei)
58 4496 Azoarcus sp. (strain EbN1)
59 4419 Erwinia carotovora (subsp. atroseptica) (Pectobacterium atrosepticum)
60 4395 Vibrio parahaemolyticus
61 4317 Mycobacterium tuberculosis
62 4291 Mycobacterium paratuberculosis
63 4233 Silicibacter pomeroyi DSS-3
64 4198 Gloeobacter violaceus
65 4188 Photorhabdus luminescens (subsp. laumondii)
66 4168 Shewanella oneidensis
67 4158 Haloarcula marismortui (Halobacterium marismortui)
68 4130 Chromobacterium violaceum
69 4124 Yersinia pseudotuberculosis
70 4094 Bacillus licheniformis (strain DSM 13 / ATCC 14580)
71 4072 Salmonella enterica subsp. enterica serovar Paratypi A str. ATCC 9150
72 4069 Methanosarcina acetivorans
73 4067 Bacillus clausii (strain KSM-K16)
74 4060 Salmonella typhi
75 4029 Vibrio vulnificus
76 3973 Escherichia coli O6
77 3941 Vibrio cholerae
78 3920 Xanthomonas axonopodis (pv. citri)
79 3894 Bordetella parapertussis
80 3858 Plasmodium falciparum
81 3843 Bacillus licheniformis
82 3839 Corynebacterium glutamicum (Brevibacterium flavum)
83 3777 Salmonella typhimurium
84 3771 Oryza sativa (Rice)
85 3768 Shigella flexneri
86 3759 Listeria monocytogenes
87 3716 Xanthomonas campestris (pv. campestris)
88 3570 Enterococcus faecalis (Streptococcus faecalis)
89 3567 Bacillus halodurans
90 3552 Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni)
91 3535 Bdellovibrio bacteriovorus
92 3511 Geobacillus kaustophilus HTA426
93 3487 TT virus
94 3441 Streptococcus pneumoniae
95 3415 Clostridium acetobutylicum
96 3393 Desulfovibrio vulgaris (strain Hildenborough / ATCC 29579 / NCIMB 8303)
97 3325 Caulobacter crescentus
98 3289 Geobacter sulfurreducens
99 3283 Symbiobacterium thermophilum
100 3269 Chimpanzee immunodeficiency virus (SIV(cpz)) (CIV)
3.3 Distribution of the sequences by sections
Division sequences (% of the database)
archaea 43134 ( 2.7%)
fungi 62926 ( 4%)
human 50385 ( 3.2%)
invertebrates 184252 ( 11.6%)
mammals 34073 ( 2.1%)
plants 179409 ( 11.3%)
bacteria 605632 ( 38.1%)
rodents 55021 ( 3.5%)
unclassified 1045 ( 0%)
viruses 288453 ( 18%)
vertebrates 85041 ( 5.3%)
4. SEQUENCE SIZE
4.1 Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 18352 1001-1100 8773
51- 100 95681 1101-1200 6260
101- 150 118102 1201-1300 4728
151- 200 109209 1301-1400 3046
201- 250 110494 1401-1500 2514
251- 300 102539 1501-1600 1730
301- 350 99602 1601-1700 1359
351- 400 80912 1701-1800 1189
401- 450 62563 1801-1900 944
451- 500 54264 1901-2000 791
501- 550 42499 2001-2100 607
551- 600 29474 2101-2200 733
601- 650 22620 2201-2300 612
651- 700 17682 2301-2400 494
701- 750 14980 2401-2500 322
751- 800 12273 >2500 3046
801- 850 10415
851- 900 9233
901- 950 6740
951-1000 5475
4.2 Longest and shortest sequences
The shortest sequence is Q16047: 4 amino acids.
The longest sequence is Q8WZ42: 34350 amino acids.
5. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some UniProt/TrEMBL
lines, as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 2220491 1.40
Journal 1395141 1163227 0.88
Submitted to EMBL/GenBank/DDBJ 816582 627835 0.51
Thesis 4582 4530 <0.01
Book citation 3718 3674 <0.01
Submitted to other databases 452 444 <0.01
Unpublished results 10 10 <0.01
Unpublished observations 4 4 <0.01
Plant Gene Register 1 1 <0.01
Patent 1 1 <0.01
Comments (CC) 835627 0.53
SIMILARITY 222175 218793 0.14
FUNCTION 143297 142581 0.09
CATALYTIC ACTIVITY 136440 123511 0.09
SUBCELLULAR LOCATION 126593 126592 0.08
SUBUNIT 65266 65258 0.04
CAUTION 47416 47413 0.03
PATHWAY 42505 42266 0.03
COFACTOR 38630 38630 0.02
INTERACTION 5097 5097 <0.01
MISCELLANEOUS 4142 4125 <0.01
DOMAIN 3454 3262 <0.01
ALLERGEN 163 163 <0.01
TISSUE SPECIFICITY 138 138 <0.01
MASS SPECTROMETRY 121 65 <0.01
DEVELOPMENTAL STAGE 55 55 <0.01
INDUCTION 45 45 <0.01
PTM 38 37 <0.01
ALTERNATIVE PRODUCTS 38 38 <0.01
ENZYME REGULATION 8 8 <0.01
POLYMORPHISM 3 3 <0.01
DISEASE 3 3 <0.01
Features (FT) 951302 0.60
NON_TER 895245 527251 0.56
CHAIN 39563 23647 0.02
SIGNAL 12522 12311 0.01
NON_CONS 929 432 <0.01
TRANSIT 582 578 <0.01
CARBOHYD 580 100 <0.01
DOMAIN 520 168 <0.01
SE_CYS 318 168 <0.01
TRANSMEM 229 52 <0.01
REPEAT 169 23 <0.01
CONFLICT 164 27 <0.01
DISULFID 98 34 <0.01
VARSPLIC 77 31 <0.01
VARIANT 53 13 <0.01
METAL 43 17 <0.01
ACT_SITE 43 29 <0.01
UNSURE 33 14 <0.01
DNA_BIND 30 24 <0.01
NP_BIND 23 19 <0.01
MOD_RES 22 12 <0.01
ZN_FING 16 8 <0.01
PROPEP 15 12 <0.01
SITE 10 10 <0.01
CA_BIND 4 3 <0.01
PEPTIDE 4 4 <0.01
BINDING 3 3 <0.01
LIPID 3 2 <0.01
MUTAGEN 3 2 <0.01
INIT_MET 1 1 <0.01
Cross-references (DR) 11393181 7.17
GO 3490371 1018322 2.20
InterPro 2053199 1165127 1.29
EMBL 1851113 1583287 1.16
Pfam 1456963 1099139 0.92
PROSITE 748989 488427 0.47
PRINTS 316136 262369 0.20
HSSP 295204 294924 0.19
SMART 273636 211019 0.17
PIR 198843 163073 0.13
ProDom 190432 182879 0.12
TIGRFAMs 161550 149520 0.10
TIGR 83793 77785 0.05
Ensembl 75459 75444 0.05
Gramene 45809 45808 0.03
MGD 25480 25478 0.02
FlyBase 23005 22734 0.01
WormPep 19282 19203 0.01
WormBase 19270 19203 0.01
PIRSF 9497 9497 0.01
MEROPS 8679 8395 0.01
ZFIN 6174 6171 <0.01
IntAct 5438 5438 <0.01
ListiList 4826 4809 <0.01
AGD 4491 4491 <0.01
PhotoList 4309 4185 <0.01
Genew 3568 3568 <0.01
PDB 2945 1720 <0.01
RGD 2594 2579 <0.01
TubercuList 2497 2491 <0.01
GeneDB_SPombe 2236 2221 <0.01
SagaList 1834 1740 <0.01
SGD 1435 1434 <0.01
TRANSFAC 1042 1028 <0.01
Leproma 991 989 <0.01
DictyBase 980 980 <0.01
MypuList 612 608 <0.01
REBASE 126 121 <0.01
PHCI-2DPAGE 108 108 <0.01
SWISS-2DPAGE 98 98 <0.01
ANU-2DPAGE 74 74 <0.01
Reactome 34 34 <0.01
OGP 29 29 <0.01
PhosSite 12 12 <0.01
MIM 12 11 <0.01
PMMA-2DPAGE 3 3 <0.01
Siena-2DPAGE 2 2 <0.01
COMPLUYEAST-2DPAGE 1 1 <0.01
6. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in UniProt/TrEMBL: 205506
Total number of entries encoded on a chloroplast: 39087
Total number of entries encoded on a mitochondrion: 91928
Total number of entries encoded on a plasmid: 32361
Number of additional sequences encoded on splice variants: 57
| Submissions and Updates |
|---|
We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available.
Submit new sequence data, updates and corrections at http://www.uniprot.org/support/submissions.shtml
For all queries regarding submissions to UniProt and to submit new protein sequence data, please contact:
UniProt Knowledgebase
The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: (+44 1223) 494 462
Telefax: (+44 1223) 494 468
E-mail:
| Download information |
|---|
The latest data of the UniProt Knowledgebase is available in various format (flatfile, XML or FASTA) at http://www.uniprot.org/database/download.shtml. The data is further supplemented by two files containing the sequences of all additional splice isoforms annotated in UniProt/Swiss-Prot and UniProt/TrEMBL. These data sets are documented in the file ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/README.varsplic
For users who wish to download the UniProt Knowledgebase only occasionally, we distribute the latest major release (updated 4 times per year) in flatfile format. Previous UniProt/Swiss-Prot and UniProt/TrEMBL are archived under ftp://ftp.uniprot.org/databases/uniprot/previous_major_releases The UniProt Knowledgebase major release is also available on CD-ROM from the EBI.
| Contact |
|---|
| Citation |
|---|
If you want to cite UniProt in a publication please use the following reference:
Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S., The Universal Protein Resource (UniProt), Nucleic Acids Res. 33: D154-D159 (2005).