|
EMBOSS: scope
|
Program scope
Function
Convert raw scop classification file to embl-like format
Description
Nearly all proteins have structural similarities with other proteins
and, in some of these cases, share a common evolutionary origin. A
knowledge of these relationships is crucial to our understanding of the
evolution of proteins and of development. It will also play an
important role in the analysis of the sequence data that is being
produced by worldwide genome projects.
The
SCOP
database aims to provide a detailed and comprehensive
description of the structural and evolutionary relationships between all
proteins whose structure is known, including all entries in the Protein
Data Bank
(PDB).
scope reads the SCOP classification file available at
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?dir=lin
scope writes the SCOP classification to an EMBL-like format file.
No changes are made to the data other than changing the format in which it is held.
This EMBL-like format SCOP file is used by several other EMBOSS programs.
The reason why the SCOP database format is changed to an EMBL-like
format before being used used by other EMBOSS programs is that it is an easier
format to work with than the native SCOP database format.
Usage
Here is a sample session with scope:
% scope
Convert raw scop classification file to embl-like format
Name of scop file for input (raw format) [scop.orig]: /data/scop/scop.orig
Name of scop file for output (embl-like format) [Escop.dat]: Escop.test
Command line arguments
Mandatory qualifiers:
[-infile] infile Name of scop file for input (raw format)
[-outfile] outfile Name of scop file for output (embl-like
format)
Optional qualifiers: (none)
Advanced qualifiers: (none)
General qualifiers:
-help bool report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
|
Mandatory qualifiers |
Allowed values |
Default |
[-infile] (Parameter 1) |
Name of scop file for input (raw format) |
Input file |
scop.orig |
[-outfile] (Parameter 2) |
Name of scop file for output (embl-like format) |
Output file |
Escop.dat |
Optional qualifiers |
Allowed values |
Default |
(none) |
Advanced qualifiers |
Allowed values |
Default |
(none) |
Input file format
The native format SCOP database input file is available at
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?dir=lin
The format of this file is explained at
http://scop.mrc-lmb.cam.ac.uk/scop/parindex.html
The file given at this URL contains a single line for each domain in SCOP,
including text describing the position of the domain in the SCOP hierarchy.
Note that other SCOP classification files, without this annotation, are
available at
http://scop.mrc-lmb.cam.ac.uk/scop/parindex.html
Output file format
The output
records used to describe an entry are given below. Records (4) to (8)
are used to describe the position of the domain in the scop hierarchy.
- ID - Domain identifier code. This is a 7-character code that uniquely
identifies the domain in scop. It is identical to the first 7 characters
of a line in the scop classification file. The first character is always
'D', the next four characters are the PDB identifier code, the fifth
character is the PDB chain identifier to which the domain belongs (a '.' is
given in cases where the domain is composed of multiple chains, a '_' is
given where a chain identifier was not specified in the PDB file) and the
final character is the number of the domain in the chain (for chains
comprising more than one domain) or '_' (the chain comprises a single
domain only).
- EN - PDB identifier code. This is the 4-character PDB identifier code
of the PDB entry containing the domain.
- OS - Source of the protein. It is identical to the text given after
'Species' in the scop classification file.
- CL - Domain class. It is identical to the text given after 'Class' in
the scop classification file.
- FO - Domain fold. It is identical to the text given after 'Fold' in
the scop classification file.
- SF - Domain superfamily. It is identical to the text given after
'Superfamily' in the scop classification file.
- FA - Domain family. It is identical to the text given after 'Family' in
the scop classification file.
- DO - Domain name. It is identical to the text given after 'Protein' in
the scop classification file.
- NC - Number of chains comprising the domain (usually 1). If the number
of chains is greater than 1, then the domain entry will have a section
containing a CN and a CH record (see below) for each chain.
- CN - Chain number. The number given in brackets after this record
indicates the start of the data for the relevent chain.
- CH - Domain definition. The character given before CHAIN is the PDB
chain identifier (a '.' is given in cases where a chain identifier was not
specified in the scop classification file), the strings before START and
END give the start and end positions respectively of the domain in the PDB
file (a '.' is given in cases where a position was not specified). Note
that the start and end positions refer to residue numbering given in the
original pdb file and therefore must be treated as strings.
An example of an excerpt from an output file follows:
ID D3SDHA_
XX
EN 3SDH
XX
OS Ark clam (Scapharca inaequivalvis)
XX
CL All alpha proteins
XX
FO Globin-like
XX
SF Globin-like
XX
FA Globins
XX
DO Hemoglobin I
XX
NC 1
XX
CN [1]
XX
CH a CHAIN; . START; . END;
//
ID D3SDHB_
XX
EN 3SDH
XX
OS Ark clam (Scapharca inaequivalvis)
XX
CL All alpha proteins
XX
FO Globin-like
XX
SF Globin-like
XX
FA Globins
XX
DO Hemoglobin I
XX
NC 1
XX
CN [1]
XX
CH b CHAIN; . START; . END;
//
Data files
None.
Notes
None.
References
None.
Warnings
None.
Diagnostic Error Messages
None.
Exit status
It always exits with status 0.
Known bugs
None.
Author(s)
This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)
History
Written (Jan 2001) - Jon Ison.
Target users
This program is intended to be run by EMBOSS site maintainers or those
responsible for setting up and maintaining protein 3D structural data
for use by others.
Comments