SWISS-PROT RELEASE 8.0 RELEASE NOTES
Date: August 4, 1988
Author: A. Bairoch
1. INTRODUCTION
1.1 Evolution
Release 8.0 of SWISS-PROT contains 7724 sequence entries,
comprising 2'224'465 amino acids abstracted from 8088
references. This represents an increase of 13% over release
7.0. The recent growth of the data bank is summarised
below:
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
1.2 Source of data
Release 8.0 has been updated using protein sequence data
from release 16.0 of the PIR (Protein Identification
Resource) protein data bank, as well as translation of
nucleotide sequence data from release 15.0 of the EMBL
nucleotide sequence Data Library.
As an indication to the source of the sequence data in the
SWISS-PROT data bank we list here the statistics concerning
the DR (Databank Reference) pointer lines:
Entries with pointer(s) to only PIR entri(es): 3385
Entries with pointer(s) to only EMBL entri(es): 2084
Entries with pointer(s) to both EMBL and PIR entri(es): 2057
Entries with no pointers lines (entered in house): 198
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE
RELEASE 7
2.1 Sequences and annotations
Some 903 new sequences have been added since the last
release, the sequence data of 93 existing entries has been
updated and the annotations of almost 2400 entries have
been revised. In particular we have used reviews articles
to update the annotations of the following groups or
families of proteins:
1,6-fructose-bisphosphate aldolases.
Alkaline phosphatases.
Annexins.
Endo- and Exo-glucanases.
Bacterial ferredoxins.
Beta-adrenergic receptors.
Chemotaxis proteins.
Collagens.
Complement proteins.
Cytochrome B5.
Cytosolic lipid-binding proteins.
Fumarases.
Malate dehydrogenases.
Muscarinic acetylcholine receptors.
Opsins.
Protamines.
PTS system proteins.
RNA polymerase sigma factors.
Serpins.
Small, acid-soluble spore proteins (SASP).
Tachykinins.
TGF-beta/Inhibins/MIS.
Zinc-finger proteins.
Zinc metallo endopeptidases.
2.2 New feature table keys
Two new feature keys have been introduced with this
release:
ZN_FING: Extent of a "zinc-finger" region.
NP_BIND: Extent of a nucleotide phosphate binding
region. The nature of the nucleotide phosphate
is indicated in the description field. Examples
of NP_BIND key feature lines:
FT NP_BIND 13 25 ATP.
FT NP_BIND 45 49 GTP (POTENTIAL).
2.3 Enhancement in the format of the CC line
A major proportion of the comment blocks are now arranged
according to what we designate as 'topics`, the format of a
comment block which belongs to a 'topic` is:
CC -!- TOPIC: FREE TEXT DESCRIPTION.
The current topics and their definition are:
CATALYTIC ACTIVITY Description of the reaction(s) catalysed
by an enzyme [*].
DISEASE Description of the disease(s) associated
with a deficiency of a protein.
FUNCTION General description of the function(s)
of a protein.
INDUCTION Description of the compound(s) which
stimulate the synthesis of a protein.
PATHWAY Description of the metabolic pathway(s)
to which is associated a protein.
SIMILARITY Description of the similariti(es)
(sequence or structural) of a protein
with other proteins.
SUBCELLULAR LOCATION Description of the subcellular
location of a mature protein product.
SUBUNIT Description of the quaternary structure
of a protein.
[*] Whenever it was possible we have used, to describe the
catalytic activity of an enzyme, the recommendations of
the Nomenclature Committee of the International Union of
Biochemistry (IUB) as published in:
Enzyme Nomenclature, NC-IUB, Academic Press, New-York,
(1984).
and
Supplement 1: Corrections and Additions, Eur. J.
Biochem. 157:1-26(1986).
Here is, for each of the topics defined above, an example
of its usage:
CC -!- CATALYTIC ACTIVITY: ATP + L-GLUTAMATE + NH(3) =
CC ADP + GLUTAMINE + ORTHOPHOSPHATE.
CC -!- DISEASE: NADH-CYTOCHROME B5 REDUCTASE DEFICIENCY
CC CAUSES HEREDITARY METHEMOGLOBINEMIA.
CC -!- FUNCTION: PREVENTS THE POLYMERIZATION OF ACTIN.
CC -!- INDUCTION: ARGINASE ACTIVITY CAN BE INDUCED BY
CC ARGININE OR HOMOARGININE.
CC -!- PATHWAY: FIRST STEP IN PROLINE BIOSYNTHESIS.
CC -!- SIMILARITY: WITH SUBTILISINS, THERMITASE, AND
CC PROTEINASE K.
CC -!- SUBCELLULAR LOCATION: MITOCHONDRIAL MATRIX.
CC -!- SUBUNIT: TETRAMER OF IDENTICAL CHAINS.
2.4 Small change in the format of the RN line
The syntax of RN lines for articles which report the
sequence of a protein translated from a nucleotide sequence
used to be "SEQUENCE TRANSL. FROM N.A. SEQ.", and has now
been changed to "SEQUENCE FROM N.A."
Examples:
RN [1] (SEQUENCE FROM N.A.)
RN [1] (SEQUENCE OF 12-143 FROM N.A.)
2.5 Changes in the documentation
The document file SPECFREQ.TXT which was used to list the
frequency table of usage of the species identification
codes has been discontinued, this information is now
included in a new appendix of these release notes (Appendix
B), along with other statistics such as the amino acid
composition in the data bank and the repartition of the
sequences by size (number of residues).
3. THE NEXT RELEASE
SWISS-PROT release 9.0 will be available in December 1988.
We plan to upgrade the OC lines so as to add at least a few
level of taxonomic nodes (currently only the top-node:
viridae, prokaryota or eukaryota is available).
4. WE NEED YOUR HELP !
We welcome any feedback from our users. We especially would
appreciate that you notify us if you find that sequences
belonging to your field of expertise are missing from the
data bank. We also would like to be notified about
annotations to be updated, as for example if the function
of a protein has been clarified or if new post-
translational information has become available.
APPENDIX A: DISKS FOR SWISS-PROT
A.1 IBM PC/AT 1.2 Mb disks
SWISS-PROT is stored on nine 1.2 Mb disks. Each of these
disk contains a single bulk file (PROT8_01.BLK to
PROT8_09.BLK):
Disk First sequence Last Sequence
1 16K$TRVPS CD8$HUMAN
2 CD8$MOUSE DYR$LACCA
3 DYR$MESAU HBA$CAVPO
4 HBA$CEBAP KNL1$BOVIN
5 KNL1$HUMAN NRAM$INATO
6 NRAM$INATR RASN$MOUSE
7 RB1$HUMAN TRA$MAIZE
8 TRA$PSEAE Y8K6$VACCV
9 Y90K$VACCV ZEG1$MAIZE
A.2 IBM PS/2 1.4 Mb disks
The number and content of the 1.4 Mb disks for the PS/2
systems are exactly identical to those of the 1.2 Mb disks
(see above).
A.3 IBM PC 360 Kb disks
SWISS-PROT is stored on twenty eight (28) 360 Kb disks.
Each of these disks contains a single bulk file
(PROT8_01.BLK to PROT8_28.BLK):
Disk First sequence Last Sequence
1 16K$TRVPS AMEG$BOVIN
2 AMID$XENLA AZUR$ALCDE
3 AZUR$ALCFA CART$CHICK
4 CAS1$BOVIN COA1$POVMA
5 COA1$SV40 CRG2$RANTE
6 CRG2$RAT CYS1$DICDI
7 CYS2$DICDI ELA2$MOUSE
8 ELA2$PIG FIBG$RAT
9 FIBH$HUMAN GLEM$HUMAN
10 GLGA$ECOLI HB2I$HUMAN
11 HB2I$MOUSE HEMA$MEASH
12 HEMA$NDV IG1R$HUMAN
13 IGAO$DACDE KAG2$MOUSE
14 KAG3$MOUSE KV5C$MOUSE
15 KV5D$MOUSE MAL3$DROME
16 MALE$ECOLI MYSB$RABIT
17 MYSP$RAT NUO5$PANPA
18 NUO5$PONPY PG40$HUMAN
19 PGCA$CHICK POLN$SFV
20 POLN$SINDV RADX$YEAST
21 RAN1$SCHPO ROB2$HUMAN
22 ROC$HUMAN SP0A$BACSU
23 SP0B$BACSU THRB$BOVIN
24 THRB$HUMAN TX1$ANESU
25 TX11$NAJHH VGE$BPS13
26 VGF$BPG4 VNST$VSVJ
27 VNST$VSVJM YGYW$ECOLI
28 YHL1$EPBAR ZEG1$MAIZE
A.4 IBM PS/2 720 Kb disks
SWISS-PROT is stored on fourteen (14) 720 Kb disks. Each of
these disks contains two of the 360 Kb format bulk files.
APPENDIX B: SOME STATISTICS
B.1 Amino acid composition
Composition in percent for the complete data bank
Ala (A) 7.76 Gln (Q) 4.11 Leu (L) 9.07 Ser (S) 7.04
Arg (R) 5.12 Glu (E) 6.12 Lys (K) 5.88 Thr (T) 5.86
Asn (N) 4.36 Gly (G) 7.33 Met (M) 2.28 Trp (W) 1.36
Asp (D) 5.21 His (H) 2.30 Phe (F) 3.95 Tyr (Y) 3.25
Cys (C) 1.93 Ile (I) 5.28 Pro (P) 5.21 Val (V) 6.52
Asx (B) 0.02 Glx (Z) 0.02 Xaa (X) 0.01
Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Lys, Thr, Ile, Pro, Asp, Arg,
Asn, Gln, Phe, Tyr, His, Met, Cys, Trp
B.2 Repartition of the sequences by their organism of
origin
Total number of species represented in this release of the
data bank: 1345
Species represented 1x: 636
2x: 259
3x: 140
4x: 88
5x: 40
6x: 39
7x: 25
8x: 23
9x: 15
10x: 8
>10x: 72
Table of the most represented species
739: HUMAN 133: CHICK 62: BACSU 55: TOBAC
628: ECOLI 125: RABIT LAMBD 54: BPT7
414: MOUSE 117: DROME 59: EPBAR 52: MARPO
312: RAT 113: PIG MAIZE
256: YEAST 68: XENLA 58: SALTY
250: BOVIN 66: BPT4 VACCV
B.3 Repartition of the sequences by size
From To Number From To Number
1- 50 384 1001-1100 36
51- 100 1000 1101-1200 30
101- 150 1743 1201-1300 26
151- 200 856 1301-1400 18
201- 250 581 1401-1500 12
251- 300 494 1501-1600 2
301- 350 450 1601-1700 10
351- 400 432 1701-1800 7
401- 450 309 1801-1900 5
451- 500 350 1901-2000 2
501- 550 284 >2000 42
551- 600 166
601- 650 123 Currently the two largest sequences are:
651- 700 74 APB$HUMAN 4563 a.a.
701- 750 76 APOA$HUMAN 4548 a.a.
751- 800 55
801- 850 46
851- 900 56
901- 950 36
951-1000 22