![]() |
UniProt Knowledgebase Release notes UniProt release 3.0 of 25-Oct-2004 |
| Content |
|---|
Related documents: UniProt user manual, Recent changes, Forthcoming changes.
| Introduction |
|---|
Release 3.0 of the UniProt Knowledgebase is composed of the UniProt/Swiss-Prot Protein Knowledgebase release 45.0 and the UniProt/TrEMBL Protein Database release 28.0.
More information on these databases can be found in the user manual What is the UniProt Knowledgebase ?.
| UniProt/Swiss-Prot protein Knowledgebase release 45.0 statistics |
|---|
Release 45.0 of 25-Oct-2004 of Swiss-Prot contains 163'235 sequence entries, comprising 59'631'787 amino acids abstracted from 120'520 references.
The growth of the database is summarized below.
| Release | Date | Number of entries | Number of amino acids |
|---|---|---|---|
| 2.0 | 09/86 | 3'939 | 900'163 |
| 3.0 | 11/86 | 4'160 | 969'641 |
| 4.0 | 04/87 | 4'387 | 1'036'010 |
| 5.0 | 09/87 | 5'205 | 1'327'683 |
| 6.0 | 01/88 | 6'102 | 1'653'982 |
| 7.0 | 04/88 | 6'821 | 1'885'771 |
| 8.0 | 08/88 | 7'724 | 2'224'465 |
| 9.0 | 11/88 | 8'702 | 2'498'140 |
| 10.0 | 03/89 | 10'008 | 2'952'613 |
| 11.0 | 07/89 | 10'856 | 3'265'966 |
| 12.0 | 10/89 | 12'305 | 3'797'482 |
| 13.0 | 01/90 | 13'837 | 4'347'336 |
| 14.0 | 04/90 | 15'409 | 4'914'264 |
| 15.0 | 08/90 | 16'941 | 5'486'399 |
| 16.0 | 11/90 | 18'364 | 5'986'949 |
| 17.0 | 02/91 | 20'024 | 6'524'504 |
| 18.0 | 05/91 | 20'772 | 6'792'034 |
| 19.0 | 08/91 | 21'795 | 7'173'785 |
| 20.0 | 11/91 | 22'654 | 7'500'130 |
| 21.0 | 03/92 | 23'742 | 7'866'596 |
| 22.0 | 05/92 | 25'044 | 8'375'696 |
| 23.0 | 08/92 | 26'706 | 9'011'391 |
| 24.0 | 12/92 | 28'154 | 9'545'427 |
| 25.0 | 04/93 | 29'955 | 10'214'020 |
| 26.0 | 07/93 | 31'808 | 10'875'091 |
| 27.0 | 10/93 | 33'329 | 11'484'420 |
| 28.0 | 02/94 | 36'000 | 12'496'420 |
| 29.0 | 06/94 | 38'303 | 13'464'008 |
| 30.0 | 10/94 | 40'292 | 14'147'368 |
| 31.0 | 02/95 | 43'470 | 15'335'248 |
| 32.0 | 11/95 | 49'340 | 17'385'503 |
| 33.0 | 02/96 | 52'205 | 18'531'384 |
| 34.0 | 10/96 | 59'021 | 21'210'389 |
| 35.0 | 11/97 | 69'113 | 25'083'768 |
| 36.0 | 07/98 | 74'019 | 26'840'295 |
| 37.0 | 12/98 | 77'977 | 28'268'293 |
| 38.0 | 07/99 | 80'000 | 29'085'965 |
| 39.0 | 05/00 | 86'593 | 31'411'114 |
| 40.0 | 10/01 | 101'602 | 37'315'215 |
| 41.0 | 02/03 | 122'564 | 44'986'459 |
| 42.0 | 10/03 | 135'850 | 50'046'799 |
| 43.0 | 03/04 | 146'720 | 54'093'154 |
| 44.0 | 07/04 | 153'871 | 56'608'159 |
| 45.0 | 10/04 | 163'235 | 59'631'787 |
In rare cases, Swiss-Prot entries are removed. Deleted entries are almost exclusively Open Reading Frames (ORFs) that have been wrongly predicted to code for proteins. When there is enough evidence that these hypothetical proteins are not real we take the decision to remove them from Swiss-Prot. In the document delac_sp.txt, you will find a list of all accession numbers which were previously present in UniProt/Swiss-Prot, but which have now been deleted from the database.
We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:
From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:
| Organism | Database cross-references | Index file | Number of sequences |
|---|---|---|---|
| A.thaliana | None yet | arath.txt | 2'981 |
| C.albicans | None yet | calbican.txt | 305 |
| C.elegans | Wormpep | celegans.txt | 2'543 |
| D.discoideum | DictyBase | dicty.txt | 323 |
| D.melanogaster | FlyBase | fly.txt | 2'118 |
| M.musculus | MGD | mgdtosp.txt | 8'368 |
| S.cerevisiae | SGD | yeast.txt | 4'992 |
| S.pombe | GeneDB_SPombe | pombe.txt | 2'672 |
1. INTRODUCTION
Release 45.0 of 25-Oct-2004 of UniProt/Swiss-Prot contains 163235 sequence
entries, comprising 59631787 amino acids abstracted from 120520 references.
6183 sequences have been added since release 44, the sequence data of
2851 existing entries has been updated and the annotations of
71220 entries have been revised. This represents an increase of 4%.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87
Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46
Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16
Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07
Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01
2.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of
UniProt/Swiss-Prot: 8703
The first twenty species represent 61239 sequences: 37.5 % of the total
number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x: 4130
2x: 1366
3x: 690
4x: 455
5x: 282
6x: 261
7x: 196
8x: 151
9x: 132
10x: 84
11- 20x: 364
21- 50x: 276
51-100x: 97
>100x: 219
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 11539 Homo sapiens (Human)
2 8368 Mus musculus (Mouse)
3 4992 Saccharomyces cerevisiae (Baker's yeast)
4 4838 Escherichia coli
5 3976 Rattus norvegicus (Rat)
6 2981 Arabidopsis thaliana (Mouse-ear cress)
7 2750 Bacillus subtilis
8 2672 Schizosaccharomyces pombe (Fission yeast)
9 2543 Caenorhabditis elegans
10 2118 Drosophila melanogaster (Fruit fly)
11 1782 Methanococcus jannaschii
12 1773 Haemophilus influenzae
13 1690 Escherichia coli O157:H7
14 1506 Bos taurus (Bovine)
15 1454 Salmonella typhimurium
16 1399 Mycobacterium tuberculosis
17 1344 Escherichia coli O6
18 1307 Shigella flexneri
19 1114 Gallus gallus (Chicken)
20 1093 Mycobacterium bovis
21 1036 Salmonella typhi
22 1004 Pseudomonas aeruginosa
23 957 Synechocystis sp. (strain PCC 6803)
24 951 Archaeoglobus fulgidus
25 904 Sus scrofa (Pig)
26 900 Xenopus laevis (African clawed frog)
27 803 Rhizobium meliloti (Sinorhizobium meliloti)
28 784 Vibrio cholerae
29 753 Yersinia pestis
30 744 Aquifex aeolicus
31 742 Oryctolagus cuniculus (Rabbit)
32 687 Mycoplasma pneumoniae
33 676 Pasteurella multocida
34 619 Streptomyces coelicolor
35 618 Vibrio parahaemolyticus
36 609 Mycobacterium leprae
37 608 Bacillus halodurans
38 606 Treponema pallidum
39 572 Buchnera aphidicola (subsp. Acyrthosiphon pisum)
40 571 Methanobacterium thermoautotrophicum
41 571 Vibrio vulnificus
42 566 Anabaena sp. (strain PCC 7120)
43 561 Buchnera aphidicola (subsp. Schizaphis graminum)
44 560 Helicobacter pylori (Campylobacter pylori)
45 546 Rickettsia prowazekii
46 541 Helicobacter pylori J99 (Campylobacter pylori J99)
47 536 Staphylococcus aureus (strain Mu50 / ATCC 700699)
48 534 Staphylococcus aureus (strain N315)
49 517 Staphylococcus aureus (strain MW2)
50 511 Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
51 508 Zea mays (Maize)
52 507 Pseudomonas putida (strain KT2440)
53 507 Buchnera aphidicola (subsp. Baizongia pistaciae)
54 500 Pseudomonas syringae (pv. tomato)
55 496 Ralstonia solanacearum (Pseudomonas solanacearum)
56 491 Listeria monocytogenes
57 489 Staphylococcus epidermidis
58 488 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
59 486 Mycoplasma genitalium
60 486 Listeria innocua
61 481 Rhizobium loti (Mesorhizobium loti)
62 477 Xanthomonas campestris (pv. campestris)
63 475 Neisseria meningitidis (serogroup B)
64 473 Neisseria meningitidis (serogroup A)
65 465 Clostridium acetobutylicum
66 461 Caulobacter crescentus
67 460 Bradyrhizobium japonicum
68 457 Thermotoga maritima
69 456 Bacillus anthracis
70 445 Canis familiaris (Dog)
71 439 Xanthomonas axonopodis (pv. citri)
72 436 Xylella fastidiosa
73 431 Streptococcus pneumoniae
74 430 Deinococcus radiodurans
75 430 Oryza sativa (Rice)
76 424 Pyrococcus horikoshii
77 424 Xylella fastidiosa (strain Temecula1 / ATCC 700964)
78 421 Chlamydia trachomatis
79 420 Pyrococcus abyssi
80 417 Borrelia burgdorferi (Lyme disease spirochete)
81 411 Shewanella oneidensis
82 409 Chlamydia pneumoniae (Chlamydophila pneumoniae)
83 408 Brucella melitensis
84 407 Brucella suis
85 405 Clostridium perfringens
86 403 Rhizobium sp. (strain NGR234)
87 399 Methanosarcina acetivorans
88 399 Chlamydia muridarum
89 395 Corynebacterium glutamicum (Brevibacterium flavum)
90 389 Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
91 386 Bacillus cereus (strain ATCC 14579 / DSM 31)
92 386 Methanosarcina mazei (Methanosarcina frisia)
93 380 Pyrococcus furiosus
94 378 Campylobacter jejuni
95 375 Sulfolobus solfataricus
96 371 Thermoanaerobacter tengcongensis
97 368 Oceanobacillus iheyensis
98 365 Neurospora crassa
99 364 Lactobacillus plantarum
100 361 Streptococcus pyogenes
101 361 Nicotiana tabacum (Common tobacco)
102 360 Ovis aries (Sheep)
103 359 Rickettsia conorii
104 353 Vibrio vulnificus (strain YJ016)
105 350 Streptococcus pneumoniae (strain ATCC BAA-255 / R6)
106 349 Photorhabdus luminescens (subsp. laumondii)
107 347 Synechococcus elongatus (Thermosynechococcus elongatus)
108 340 Brachydanio rerio (Zebrafish) (Danio rerio)
109 337 Streptococcus mutans
110 332 Aeropyrum pernix
111 329 Chlorobium tepidum
112 323 Dictyostelium discoideum (Slime mold)
113 317 Streptococcus pyogenes (serotype M18)
114 312 Streptococcus pyogenes (serotype M3)
115 312 Staphylococcus aureus
116 309 Methanopyrus kandleri
117 305 Candida albicans (Yeast)
118 302 Pisum sativum (Garden pea)
119 301 Sulfolobus tokodaii
120 299 Enterococcus faecalis (Streptococcus faecalis)
121 287 Thermoplasma acidophilum
122 282 Corynebacterium efficiens
123 282 Triticum aestivum (Wheat)
124 280 Bordetella pertussis
125 278 Haemophilus ducreyi
126 277 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
127 270 Hordeum vulgare (Barley)
128 269 Streptomyces avermitilis
129 268 Fusobacterium nucleatum (subsp. nucleatum)
130 268 Bacteriophage T4
131 266 Bordetella parapertussis
132 263 Chromobacterium violaceum
133 263 Nitrosomonas europaea
134 263 Glycine max (Soybean)
135 257 Lycopersicon esculentum (Tomato)
136 256 Cavia porcellus (Guinea pig)
137 255 Streptococcus agalactiae (serotype V)
138 254 Vaccinia virus (strain Copenhagen)
139 253 Rhodobacter capsulatus (Rhodopseudomonas capsulata)
140 253 Pyrobaculum aerophilum
141 253 Thermoplasma volcanium
142 253 Streptococcus agalactiae (serotype III)
143 252 Solanum tuberosum (Potato)
144 252 Leptospira interrogans
145 249 Pan troglodytes (Chimpanzee)
146 249 Pseudomonas putida
147 238 Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
148 237 Spinacia oleracea (Spinach)
149 232 Bacillus stearothermophilus
150 221 Wigglesworthia glossinidia brevipalpis
151 220 Porphyra purpurea
152 218 Chlamydophila caviae
153 215 Clostridium tetani
154 214 Coxiella burnetii
155 212 Synechococcus sp. (strain WH8102)
156 212 Chlamydomonas reinhardtii
157 207 Gloeobacter violaceus
158 207 Bacteroides thetaiotaomicron
159 206 Equus caballus (Horse)
160 206 Prochlorococcus marinus
161 204 Klebsiella pneumoniae
162 203 Prochlorococcus marinus (strain MIT 9313)
163 201 Macaca fascicularis (Crab eating macaque) (Cynomolgus monkey)
164 200 Kluyveromyces lactis (Yeast)
3.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 8886 ( 5%)
Bacteria 71350 ( 44%)
Eukaryota 74328 ( 46%)
Viruses 8671 ( 5%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 11539 ( 16%) ( 7%)
Other Mammalia 20961 ( 28%) ( 13%)
Other Vertebrata 6796 ( 9%) ( 4%)
Viridiplantae 11474 ( 15%) ( 7%)
Fungi 11135 ( 15%) ( 7%)
Insecta 4073 ( 5%) ( 2%)
Nematoda 2792 ( 4%) ( 2%)
Other 5558 ( 7%) ( 3%)
4. SEQUENCE SIZE
Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 3089 1001-1100 1382
51- 100 11502 1101-1200 1000
101- 150 16542 1201-1300 723
151- 200 15548 1301-1400 539
201- 250 16096 1401-1500 424
251- 300 13706 1501-1600 272
301- 350 14454 1601-1700 206
351- 400 12990 1701-1800 143
401- 450 9990 1801-1900 162
451- 500 8487 1901-2000 129
501- 550 6464 2001-2100 80
551- 600 4396 2101-2200 125
601- 650 3735 2201-2300 111
651- 700 2616 2301-2400 71
701- 750 2206 2401-2500 63
751- 800 1864 >2500 435
801- 850 1486
851- 900 1662
901- 950 1135
951-1000 954
The average sequence length in UniProt/Swiss-Prot is 365 amino acids.
The shortest sequence is GWA_SEPOF (P83570): 2 amino acids.
The longest sequence is SNE1_HUMAN (Q8NF91): 8797 amino acids.
5. JOURNAL CITATIONS
Note: the following citation statistics reflect the number of distinct
journal citations.
Total number of journals cited in this release of UniProt/Swiss-Prot: 1516
5.1 Table of the frequency of journal citations
Journals cited 1x: 556
2x: 203
3x: 105
4x: 63
5x: 66
6x: 33
7x: 34
8x: 23
9x: 26
10x: 15
11- 20x: 120
21- 50x: 113
51-100x: 53
>100x: 106
5.2 List of the most cited journals in UniProt/Swiss-Prot
Nb Citations Journal name
-- --------- -------------------------------------------------------------
1 10930 Journal of Biological Chemistry
2 5666 Proceedings of the National Academy of Sciences of the U.S.A.
3 3967 Journal of Bacteriology
4 3761 Nucleic Acids Research
5 3697 Gene
6 3004 Biochemical and Biophysical Research Communications
7 2997 FEBS Letters
8 2710 Biochemistry
9 2655 European Journal of Biochemistry
10 2516 The EMBO Journal
11 2342 Nature
12 2271 Biochimica et Biophysica Acta
13 2061 Journal of Molecular Biology
14 1977 Genomics
15 1856 Cell
16 1839 Molecular and Cellular Biology
17 1447 Biochemical Journal
18 1365 Science
19 1223 Molecular Microbiology
20 1183 Plant Molecular Biology
21 1181 Molecular and General Genetics
22 944 Journal of Biochemistry
23 895 Human Molecular Genetics
24 893 Virology
25 886 Journal of Cell Biology
26 817 Nature Genetics
27 733 Genes and Development
28 710 Journal of Virology
29 702 The American Journal of Human Genetics
30 670 Oncogene
31 667 Plant Physiology
32 654 Human Mutation
33 603 Journal of Immunology
34 592 Yeast
35 590 Infection and Immunity
36 564 Structure
37 544 Archives of Biochemistry and Biophysics
38 535 Journal of General Virology
39 519 Microbiology
40 517 Development
41 500 FEMS Microbiology Letters
42 470 Nature Structural Biology
43 467 Genetics
44 432 Human Genetics
45 423 Current Genetics
46 416 Blood
47 379 Molecular and Biochemical Parasitology
48 366 Applied and Environmental Microbiology
49 346 Journal of Clinical Investigation
50 334 Developmental Biology
51 333 Mammalian Genome
52 333 Protein Science
53 329 Molecular Endocrinology
54 322 Cancer Research
55 317 Molecular Biology of the Cell
56 310 Immunogenetics
57 308 Journal of Molecular Evolution
58 308 Neuron
59 304 DNA and Cell Biology
60 304 Mechanisms of Development
61 304 Acta Crystallographica, Section D
62 298 The Journal of Experimental Medicine
63 291 Journal of Cell Science
64 289 The Plant Cell
65 275 Biological Chemistry Hoppe-Seyler
66 267 Endocrinology
67 261 DNA Sequence
68 257 Journal of Neuroscience
69 254 The Plant Journal
70 236 Journal of General Microbiology
71 232 Journal of Neurochemistry
72 231 Molecular Biology and Evolution
73 230 The Journal of Clinical Endocrinology and Metabolism
74 228 Brain Research. Molecular Brain Research
75 216 Molecular Cell
76 214 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
77 212 Toxicon
78 205 Cytogenetics and Cell Genetics
79 199 American Journal of Physiology
80 198 Comparative Biochemistry and Physiology
81 196 Current Biology
82 194 Bioscience, Biotechnology, and Biochemistry
83 176 Antimicrobial Agents and Chemotherapy
84 176 Molecular Pharmacology
85 159 Proteins
86 156 DNA
87 149 Journal of Investigative Dermatology
88 147 Journal of Medical Genetics
89 146 DNA Research
90 146 Peptides
91 146 Tissue Antigens
92 141 Molecular Plant-Microbe Interactions
93 141 Virus Research
94 140 Biochimie
95 138 Genome Research
96 138 American Journal of Medical Genetics
97 134 Bioorganicheskaia Khimiia
98 126 European Journal of Immunology
99 123 Molecular and Cellular Endocrinology
100 123 Hemoglobin
101 121 Plant and Cell Physiology
102 117 Biology of Reproduction
103 115 Agricultural and Biological Chemistry
104 112 Insect Biochemistry and Molecular Biology
105 106 Archives of Microbiology
106 102 General and Comparative Endocrinology
107 100 Diabetes
6. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some UniProt/Swiss-Prot
lines, as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 316266 1.94
Journal 280711 153830 1.72
Submitted to EMBL/GenBank/DDBJ 32597 28003 0.20
Submitted to Swiss-Prot 776 771 <0.01
Unpublished observations 495 491 <0.01
Plant Gene Register 489 478 <0.01
Book citation 478 466 <0.01
Thesis 274 272 <0.01
Submitted to other databases 201 200 <0.01
Unpublished results 128 126 <0.01
Patent 114 112 <0.01
Worm Breeder's Gazette 3 3 <0.01
Comments (CC) 582509 3.57
SIMILARITY 166945 143752 1.02
FUNCTION 106598 104154 0.65
SUBCELLULAR LOCATION 78760 78760 0.48
CATALYTIC ACTIVITY 57355 53901 0.35
SUBUNIT 51150 51150 0.31
PATHWAY 27659 26444 0.17
COFACTOR 19319 19319 0.12
TISSUE SPECIFICITY 18041 18041 0.11
PTM 10656 9454 0.07
MISCELLANEOUS 9854 9021 0.06
DOMAIN 6436 5646 0.04
ALTERNATIVE PRODUCTS 6148 6148 0.04
CAUTION 5318 4858 0.03
INDUCTION 4483 4483 0.03
DEVELOPMENTAL STAGE 4214 4214 0.03
DISEASE 2765 2034 0.02
ENZYME REGULATION 2308 2308 0.01
MASS SPECTROMETRY 1501 1322 0.01
DATABASE 1443 1361 0.01
POLYMORPHISM 491 479 <0.01
ALLERGEN 366 366 <0.01
RNA EDITING 316 316 <0.01
TOXIC DOSE 244 242 <0.01
BIOTECHNOLOGY 89 89 <0.01
PHARMACEUTICAL 50 50 <0.01
Features (FT) 917536 5.62
DOMAIN 132159 41072 0.81
TRANSMEM 103438 22456 0.63
TURN 62434 4661 0.38
METAL 61199 15293 0.37
CONFLICT 61029 21454 0.37
STRAND 57250 4165 0.35
CARBOHYD 54750 13451 0.34
DISULFID 51096 13514 0.31
HELIX 45067 4518 0.28
ACT_SITE 35908 21679 0.22
REPEAT 35282 4995 0.22
VARIANT 29737 5516 0.18
CHAIN 28344 22968 0.17
NP_BIND 22707 15604 0.14
SIGNAL 17660 17658 0.11
MOD_RES 17515 9614 0.11
BINDING 13576 9401 0.08
SITE 13553 8092 0.08
VARSPLIC 12101 5353 0.07
NON_TER 10873 8300 0.07
ZN_FING 10322 3821 0.06
MUTAGEN 8574 2356 0.05
INIT_MET 7129 7083 0.04
PROPEP 5683 4814 0.03
LIPID 5008 3289 0.03
DNA_BIND 4983 4681 0.03
TRANSIT 3020 2995 0.02
PEPTIDE 2983 1241 0.02
CA_BIND 2178 896 0.01
NON_CONS 925 459 0.01
CROSSLNK 494 389 <0.01
UNSURE 373 153 <0.01
SE_CYS 186 129 <0.01
Cross-references (DR) 1573986 9.64
InterPro 332339 147362 2.04
EMBL 313738 155986 1.92
Pfam 190733 140008 1.17
PROSITE 144507 90826 0.89
PIR 91028 83972 0.56
HSSP 68288 68288 0.42
PRINTS 58993 48028 0.36
GO 54709 16394 0.34
TIGRFAMs 54414 47723 0.33
HAMAP 48541 48434 0.30
ProDom 43929 42102 0.27
SMART 39027 29738 0.24
PDB 24640 6662 0.15
TIGR 16273 15819 0.10
Genew 10611 10554 0.07
MIM 10078 8331 0.06
MGD 8016 7978 0.05
SGD 5041 4981 0.03
GermOnline 4927 4876 0.03
PIRSF 4793 4793 0.03
EcoGene 4228 4226 0.03
EchoBASE 4159 4127 0.03
MEROPS 3989 3889 0.02
H-InvDB 3677 3659 0.02
WormPep 2876 2535 0.02
RGD 2782 2780 0.02
SubtiList 2702 2701 0.02
FlyBase 2701 2655 0.02
GeneDB_SPombe 2700 2670 0.02
TRANSFAC 2691 2412 0.02
IntAct 2549 2549 0.02
WormBase 2488 2426 0.02
TubercuList 1427 1391 0.01
StyGene 1407 1404 0.01
SWISS-2DPAGE 1113 1113 0.01
ListiList 978 955 0.01
Reactome 712 712 <0.01
Leproma 613 609 <0.01
Gramene 562 557 <0.01
GeneFarm 500 499 <0.01
MaizeDB 412 407 <0.01
HIV 370 354 <0.01
REBASE 365 360 <0.01
OGP 358 358 <0.01
ECO2DBASE 351 299 <0.01
PhotoList 349 349 <0.01
DictyBase 324 322 <0.01
ZFIN 307 300 <0.01
GlycoSuiteDB 262 262 <0.01
SagaList 254 253 <0.01
PHCI-2DPAGE 239 239 <0.01
AGD 187 182 <0.01
MypuList 168 168 <0.01
Aarhus/Ghent-2DPAGE 128 98 <0.01
Siena-2DPAGE 103 103 <0.01
HSC-2DPAGE 85 85 <0.01
COMPLUYEAST-2DPAGE 59 59 <0.01
PhosSite 54 54 <0.01
PMMA-2DPAGE 52 52 <0.01
Maize-2DPAGE 39 39 <0.01
Rat-heart-2DPAGE 28 28 <0.01
ANU-2DPAGE 13 13 <0.01
7. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in UniProt/Swiss-Prot: 191089
Total number of entries encoded on a chloroplast: 3657
Total number of entries encoded on a mitochondrion: 2947
Total number of entries encoded on a cyanelle: 145
Total number of entries encoded on a plasmid: 2817
Number of fragments: 8448
Number of additional sequences encoded on splice variants: 9436
| UniProt/TrEMBL protein database release 28.0 statistics |
|---|
1. INTRODUCTION
Release 28.0 of 25-Oct-2004 of UniProt/TrEMBL has been produced in synch
with UniProt/Swiss-Prot release 45 and EMBL/DDBJ/GenBank nucleotide
sequence database release 80 and updates until the 24-Sept. It contains 1'449'374
sequence entries, comprising 452'535'149 amino acids.
126'364 sequences have been added since release 27, and the sequence and annotation
data of 56'945 entries have been revised. This represents an increase of 10.31%.
In the document delac_tr.txt, you will find a list of all accession numbers
which were previously present in UniProt/TrEMBL, but which have now been
deleted from the database. Most deletions are due to the deletion of the
corresponding CDS in the source nucleotide sequence databases EMBL-
Bank/DDBJ/GenBank. In addition, some entries are recognised to be Open
Reading frames (ORFs) that have been wrongly predicted to code for proteins.
When there is enough evidence that these hypothetical proteins are not real,
we take the decision to remove them from TrEMBL.
2. AMINO ACID COMPOSITION
2.1 Composition in percent for the complete database
Ala (A) 7.75 Gln (Q) 3.86 Leu (L) 9.73 Ser (S) 7.07
Arg (R) 5.30 Glu (E) 6.05 Lys (K) 5.57 Thr (T) 5.74
Asn (N) 4.49 Gly (G) 6.92 Met (M) 2.41 Trp (W) 1.37
Asp (D) 5.09 His (H) 2.27 Phe (F) 4.14 Tyr (Y) 3.15
Cys (C) 1.50 Ile (I) 6.03 Pro (P) 4.92 Val (V) 6.48
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.07
Legend: gray = aliphatic, red = acidic, green = small hydroxy,
blue = basic, black = aromatic, white = amide, yellow = sulfur
2.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Ile, Thr, Lys, Arg, Asp, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
3. TAXONOMIC ORIGIN
Total number of species represented in this release of TrEMBL: 79556
The first twenty species represent 443604 sequences: 30.6 % of the
total number of entries.
3.1 Table of the frequency of occurrence of species
Species represented 1x:39735
2x:14935
3x: 7591
4x: 3995
5x: 2237
6x: 1761
7x: 1190
8x: 1034
9x: 803
10x: 607
11- 20x: 2592
21- 50x: 1609
51-100x: 651
>100x: 816
3.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 111043 Human immunodeficiency virus 1
2 45936 Homo sapiens (Human)
3 43435 Oryza sativa (japonica cultivar-group)
4 38180 Arabidopsis thaliana (Mouse-ear cress)
5 37472 Mus musculus (Mouse)
6 23882 Drosophila melanogaster (Fruit fly)
7 20025 Caenorhabditis elegans
8 19828 Hepatitis C virus
9 15632 Anopheles gambiae str. PEST
10 10995 Neurospora crassa
11 9060 Xenopus laevis (African clawed frog)
12 8837 Brachydanio rerio (Zebrafish) (Danio rerio)
13 8183 Bradyrhizobium japonicum
14 7811 Plasmodium yoelii yoelii
15 7588 Streptomyces coelicolor
16 7438 Streptomyces avermitilis
17 7199 Rhizobium loti (Mesorhizobium loti)
18 7102 Rhodopirellula baltica
19 7021 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
20 6937 Rattus norvegicus (Rat)
21 6488 Hepatitis B virus
22 6414 Yarrowia lipolytica CLIB99
23 6397 Giardia lamblia ATCC 50803
24 6354 Pseudomonas aeruginosa
25 6322 Bacillus anthracis
26 6249 Debaryomyces hansenii CBS767
27 5879 Escherichia coli
28 5707 Burkholderia pseudomallei K96243
29 5701 Bacillus cereus (strain ATCC 10987)
30 5685 Rhizobium meliloti (Sinorhizobium meliloti)
31 5575 Anabaena sp. (strain PCC 7120)
32 5275 Photobacterium profundum (Photobacterium sp. (strain SS9))
33 5269 Yersinia pestis
34 5231 Plasmodium falciparum (isolate 3D7)
35 5133 Kluyveromyces lactis NRRL Y-1140
36 5128 Bacillus cereus ZK
37 5062 Bacillus thuringiensis (subsp. konkukian)
38 5029 Candida glabrata CBS138
39 5000 Pseudomonas syringae (pv. tomato)
40 4995 uncultured bacterium
41 4979 Bacillus licheniformis DSM 13
42 4964 Escherichia coli O157:H7
43 4936 Helicobacter pylori (Campylobacter pylori)
44 4862 Bordetella bronchiseptica (Alcaligenes bronchisepticus)
45 4860 Bacteroides fragilis
46 4855 Bacillus cereus (strain ATCC 14579 / DSM 31)
47 4807 Pseudomonas putida (strain KT2440)
48 4707 Rhodopseudomonas palustris
49 4698 Ralstonia solanacearum (Pseudomonas solanacearum)
50 4637 Bacteroides thetaiotaomicron
51 4637 Vibrio vulnificus (strain YJ016)
52 4629 Leptospira interrogans
53 4567 Ashbya gossypii (Yeast) (Eremothecium gossypii)
54 4523 Burkholderia mallei ATCC 23344
55 4473 Erwinia carotovora (subsp. atroseptica) (Pectobacterium atrosepticum)
56 4423 Shigella flexneri
57 4402 Vibrio parahaemolyticus
58 4307 Mycobacterium paratuberculosis
59 4270 Mycobacterium tuberculosis
60 4209 Gloeobacter violaceus
61 4176 Shewanella oneidensis
62 4171 Photorhabdus luminescens (subsp. laumondii)
63 4143 Chromobacterium violaceum
64 4080 Methanosarcina acetivorans
65 4076 Salmonella typhi
66 4034 Vibrio vulnificus
67 3998 Yersinia pseudotuberculosis IP 32953
68 3997 Escherichia coli O6
69 3925 Xanthomonas axonopodis (pv. citri)
70 3920 Vibrio cholerae
71 3910 Bordetella parapertussis
72 3871 Oryza sativa (Rice)
73 3829 Corynebacterium glutamicum (Brevibacterium flavum)
74 3807 Plasmodium falciparum
75 3755 Listeria monocytogenes
76 3721 Xanthomonas campestris (pv. campestris)
77 3613 Salmonella typhimurium
78 3577 Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni)
79 3577 Bacillus halodurans
80 3560 Enterococcus faecalis (Streptococcus faecalis)
81 3543 Bdellovibrio bacteriovorus
82 3438 TT virus
83 3422 Streptococcus pneumoniae
84 3421 Clostridium acetobutylicum
85 3412 Desulfovibrio vulgaris (strain Hildenborough / ATCC 29579 / NCIMB 8303)
86 3327 Caulobacter crescentus
87 3312 Symbiobacterium thermophilum
88 3309 Geobacter sulfurreducens
89 3259 Acinetobacter sp. (strain ADP1)
90 3225 Desulfotalea psychrophila
91 3204 Dictyostelium discoideum (Slime mold)
92 3125 Oceanobacillus iheyensis
93 3117 Chimpanzee immunodeficiency virus (SIV(cpz)) (CIV)
94 3092 Streptococcus pyogenes
95 3080 Bordetella pertussis
96 2971 Methanosarcina mazei (Methanosarcina frisia)
97 2873 Mycobacterium bovis
98 2863 Brucella suis
99 2841 Lactobacillus plantarum
100 2826 Gallus gallus (Chicken)
3.3 Distribution of the sequences by sections
Division sequences (% of the database)
arc 4947 ( 0%)
arp 33768 ( 2%)
fun 60361 ( 4%)
hum 42112 ( 3%)
inv 130570 ( 9%)
mam 18122 ( 1%)
mhc 11167 ( 1%)
org 122210 ( 8%)
phg 14152 ( 1%)
pln 127263 ( 9%)
pro 167218 (12%)
prp 374823 (26%)
rod 47880 ( 3%)
unc 1035 ( 0%)
vrl 133061 ( 9%)
vrt 38587 ( 3%)
vrv 122098 ( 8%)
4. SEQUENCE SIZE
4.1 Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 17511 1001-1100 7990
51- 100 86504 1101-1200 5679
101- 150 106065 1201-1300 4310
151- 200 97642 1301-1400 2774
201- 250 99121 1401-1500 2311
251- 300 91903 1501-1600 1589
301- 350 89727 1601-1700 1246
351- 400 72463 1701-1800 1113
401- 450 55700 1801-1900 883
451- 500 48422 1901-2000 732
501- 550 38437 2001-2100 568
551- 600 26652 2101-2200 668
601- 650 20340 2201-2300 577
651- 700 16042 2301-2400 456
701- 750 13627 2401-2500 288
751- 800 11127 >2500 2819
801- 850 9494
851- 900 8340
901- 950 6039
951-1000 4987
4.2 Longest and shortest sequences
The shortest sequence is Q16047: 4 amino acids.
The longest sequence is Q8WZ42: 34350 amino acids.
5. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 2070966 1.43
Journal 1310858 1081957 0.90
Submitted to EMBL/GenBank/DDBJ 752316 583823 0.52
Thesis 4475 4423 <0.01
Book citation 2836 2798 <0.01
Submitted to other databases 465 457 <0.01
Unpublished results 10 10 <0.01
Unpublished observations 4 4 <0.01
Plant Gene Register 1 1 <0.01
Patent 1 1 <0.01
Comments (CC) 740803 0.51
SIMILARITY 287840 284120 0.20
FUNCTION 104641 104601 0.07
CATALYTIC ACTIVITY 92792 81468 0.06
SUBCELLULAR LOCATION 84858 84816 0.06
SUBUNIT 57056 57045 0.04
CAUTION 41764 41763 0.03
COFACTOR 38459 38219 0.03
PATHWAY 27807 27807 0.02
MISCELLANEOUS 4489 4472 <0.01
DOMAIN 336 331 <0.01
PTM 313 312 <0.01
TISSUE SPECIFICITY 156 156 <0.01
MASS SPECTROMETRY 122 66 <0.01
DEVELOPMENTAL STAGE 58 58 <0.01
INDUCTION 49 49 <0.01
ALTERNATIVE PRODUCTS 45 45 <0.01
ENZYME REGULATION 10 10 <0.01
DISEASE 5 5 <0.01
POLYMORPHISM 3 3 <0.01
Features (FT) 889215 0.61
NON_TER 834577 492963 0.58
CHAIN 38254 22864 0.03
SIGNAL 12200 11990 0.01
NON_CONS 949 435 <0.01
CARBOHYD 590 108 <0.01
TRANSIT 588 578 <0.01
DOMAIN 582 185 <0.01
SE_CYS 301 159 <0.01
TRANSMEM 253 53 <0.01
CONFLICT 177 32 <0.01
REPEAT 173 24 <0.01
DISULFID 103 36 <0.01
VARSPLIC 90 38 <0.01
METAL 65 26 <0.01
VARIANT 53 13 <0.01
ACT_SITE 47 33 <0.01
UNSURE 33 14 <0.01
DNA_BIND 30 24 <0.01
NP_BIND 29 25 <0.01
MOD_RES 27 16 <0.01
BINDING 19 10 <0.01
ZN_FING 18 10 <0.01
PROPEP 15 12 <0.01
SITE 14 11 <0.01
MUTAGEN 10 4 <0.01
LIPID 7 4 <0.01
CA_BIND 5 4 <0.01
PEPTIDE 4 4 <0.01
INIT_MET 2 2 <0.01
Cross-references (DR) 10281511 7.09
GO 3015214 891190 2.08
InterPro 2018619 1017975 1.39
EMBL 1683762 1442914 1.16
Pfam 1301521 990543 0.90
PROSITE 658935 435811 0.45
HSSP 301450 301172 0.21
PRINTS 300132 249418 0.21
SMART 241092 187553 0.17
PIR 199166 163352 0.14
ProDom 175640 168699 0.12
TIGRFAMs 138646 128847 0.10
TIGR 76496 70728 0.05
MGD 26164 26162 0.02
Gramene 25322 24654 0.02
FlyBase 23194 22921 0.02
WormPep 19048 18955 0.01
WormBase 19021 18947 0.01
PIRSF 9470 9460 0.01
MEROPS 6446 6195 <0.01
ZFIN 6323 6320 <0.01
IntAct 5195 5195 <0.01
ListiList 4836 4819 <0.01
AGD 4503 4503 <0.01
PhotoList 4332 4208 <0.01
TubercuList 2500 2491 <0.01
PDB 2389 1375 <0.01
Genew 2310 2310 <0.01
GeneDB_SPombe 2248 2233 <0.01
SagaList 1840 1746 <0.01
SGD 1520 1520 <0.01
TRANSFAC 1091 1077 <0.01
Leproma 993 991 <0.01
DictyBase 950 950 <0.01
MypuList 614 610 <0.01
REBASE 126 121 <0.01
PHCI-2DPAGE 108 108 <0.01
SWISS-2DPAGE 106 106 <0.01
ANU-2DPAGE 76 76 <0.01
OGP 41 40 <0.01
Reactome 39 39 <0.01
MIM 14 13 <0.01
PhosSite 12 12 <0.01
PMMA-2DPAGE 3 3 <0.01
Siena-2DPAGE 2 2 <0.01
RGD 1 1 <0.01
COMPLUYEAST-2DPAGE 1 1 <0.01
6. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in TrEMBL: 202049
Total number of entries encoded on a chloroplast: 36258
Total number of entries encoded on a mitochondrion: 85929
Total number of entries encoded on a plasmid: 29227
Number of additional sequences encoded on splice variants: 66
| Submissions and Updates |
|---|
We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available.
Submit new sequence data, updates and corrections at http://www.uniprot.org/support/submissions.shtml
For all queries regarding submissions to UniProt and to submit new protein sequence data, please contact:
UniProt Knowledgebase
The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: (+44 1223) 494 462
Telefax: (+44 1223) 494 468
E-mail:
| Download information |
|---|
For users who wish to download the UniProt Knowledgebase only occasionally, we distribute the latest full release (updated 4 times per year) in flatfile format. The UniProt/Swiss-Prot Protein Knowledgebase is available at ftp://ftp.expasy.org/databases/swiss-prot/ and the UniProt/TrEMBL Protein Database is available at ftp://ftp.ebi.ac.uk/pub/databases/trembl/.
The UniProt Knowledgebase full release is also available on CD-ROM from the EBI.
The latest data of the UniProt Knowledgebase is available in various format (flatfile, XML or FASTA) at http://www.uniprot.org/database/download.shtml. The data is further supplemented by two files containing the sequences of all additional splice isoforms annotated in UniProt/Swiss-Prot and UniProt/TrEMBL. These data sets are documented in the file ftp://ftp.expasy.org/databases/sp_tr_nrdb/varsplic.txt
| Contact |
|---|
| Citation |
|---|
If you want to cite UniProt in a publication please use the following reference:
Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N. and Yeh L.L., UniProt: the Universal Protein Knowledgebase, Nucleic Acids Res. 32: D115-D119 (2004).
| Copyright |
|---|
UniProt copyright (c) 2003 UniProt consortium For non-commercial use all databases and documents in the UniProt FTP directory may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy.
For commercial use all databases and documents in the UniProt FTP directory, except the files ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz and ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. More information for commercial users can be found in: http://www.expasy.org/announce/sp_98.html
From January 1, 2005, all databases and documents in the UniProt FTP directory may be copied and redistributed freely by all entities, without advance permission, provided that this copyright statement is reproduced with each copy.
The above copyright notice also applies to these release note as well as to all other UniProt Knowledgebase documents.