![]() |
EMBOSS: domainer |
The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the Protein Data Bank (PDB).
domainer reads in an EMBL-like format SCOP classification file generated by the EMBOSS applications scope or nrscope, and EMBL-like format clean protein coordinate files generated by the coorde application. (not currently in EMBOSS, email Jon Ison jison@hgmp.mrc.ac.uk) For each domain in the scop classification file domainer writes clean domain coordinate files in EMBL-like and PDB formats. Each of these output files contains coordinates for a single SCOP domain. In cases where multiple models were determined, the data in the domain files correspond to the first model. In the rare cases where a domain is comprised of more than one chain, the data will be presented as belonging to a single chain (i.e. a single sequence, chain identifier etc will be given).
% domainer Build domain coordinate files Name of scop file for input (embl-like format) [Escop.dat]: /data/scop/Escop.dat Location of coordinate files for input (embl-like format) [./]: /data/cpdb/ Location of coordinate files for output (embl-like format) [./]: Extension of coordinate files (embl-like format) [.pxyz]: Location of coordinate files for output (pdb format) [./]: Extension of coordinate files (pdb format) [.ent]: Name of log file for the embl-like format build [domainer.log1]: log.1 Name of log file for the pdb format build [domainer.log2]: log.2 D3SDHA_ D3SDHB_ D3HBIA_ D3HBIB_ D4SDHA_ D4SDHB_ D4HBIA_ D4HBIB_ D5HBIA_ D5HBIB_ D7HBIA_ D7HBIB_
Mandatory qualifiers: [-scop] infile Name of scop file for input (embl-like format) [-cpdb] string Location of coordinate files for input (embl-like format) [-cpdbscop] string Location of coordinate files for output (embl-like format) [-cpdbextn] string Extension of coordinate files (embl-like format) [-pdbscop] string Location of coordinate files for output (pdb format) [-pdbextn] string Extension of coordinate files (pdb format) [-cpdberrf] outfile Name of log file for the embl-like format build [-pdberrf] outfile Name of log file for the pdb format build Optional qualifiers: (none) Advanced qualifiers: (none) General qualifiers: -help bool report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-scop] (Parameter 1) |
Name of scop file for input (embl-like format) | Input file | Escop.dat |
[-cpdb] (Parameter 2) |
Location of coordinate files for input (embl-like format) | Any string is accepted | ./ |
[-cpdbscop] (Parameter 3) |
Location of coordinate files for output (embl-like format) | Any string is accepted | ./ |
[-cpdbextn] (Parameter 4) |
Extension of coordinate files (embl-like format) | Any string is accepted | .pxyz |
[-pdbscop] (Parameter 5) |
Location of coordinate files for output (pdb format) | Any string is accepted | ./ |
[-pdbextn] (Parameter 6) |
Extension of coordinate files (pdb format) | Any string is accepted | .ent |
[-cpdberrf] (Parameter 7) |
Name of log file for the embl-like format build | Output file | domainer.log1 |
[-pdberrf] (Parameter 8) |
Name of log file for the pdb format build | Output file | domainer.log2 |
Optional qualifiers | Allowed values | Default | |
(none) | |||
Advanced qualifiers | Allowed values | Default | |
(none) |
(1) ID - Either the 4-character PDB identifier code (for clean protein coordinate files) or the 7-character domain identifier code taken from scop (for domain coordinate files; see documentation for the EMBOSS application scope for further info.)
(2) DE - compound information. Text from the COMPND records from the original pdb file are given.
(3) OS - protein source information. Text from the SOURCE records from the original pdb file are given.
(4) EX - experimental information. The text 'nmr_or_model' (for nuclear magnetic resonance and model structures) or 'xray' (for structures determined by X-ray crystallography) appears as appropriate after the text 'METHOD'. The resolution of X-ray structures, or '0' for structures of type 'nmr_or_model', is given after 'RESO'. The number of models and number of polypeptide chains are given after 'NMOD' and 'NCHA' respectively. For domain coordinate files a 1 is always given. Following the EX record, the file will have a section containing a CN, IN and SQ records (see below) for each chain.
(5) CN - chain number. The number given in brackets after this record indicates the start of a section of chain-specific data.
(6) IN - chain specific data. The character given after ID is the PDB chain identifier (a '.' is given in cases where a chain identifier was not specified in the pdb file or, for domain coordinate files, the domain is comprised of more than one domain). The number of amino acid residues comprising the chain (or the chains from which a domain is comprised) is given after NR. The number of atoms in heterogens and water molecules are given after NH and NW respectively. Domain coordinate files do not include coordinates for these groups so a value of 0 is always given.
(7) SQ - protein sequence. The number of residues is given before AA on the first line. The protein sequence is given on subsequent lines.
(8) CO - coordinate data. The columns of the records are as follows.
(9) XX - Used for spacing.
(10) // - Given on the last line of the file only.
(1) HEADER - bibliographic information. The text 'CLEANED-UP PDB FILE FOR SCOP DOMAIN XXXXXXX' is always given (where XXXXXXX is a 7-character domain identifier code).
(2) TITLE - bibliographic information. The text ' THIS FILE IS MISSING MOST RECORDS FROM THE ORIGINAL PDB FILE' is always given.
(3) COMPND - compound information. The COMPND records from the original pdb file are given.
(4) SOURCE - protein source information. The SOURCE records from the original PDB file are given.
(5) REMARK - remark records. Remark records are used for spacing. One REMARK line containing the protein resolution is always given.
(6) SEQRES - protein sequence.
(7) ATOM - atomic coordinates.
(8) TER - indicates the end of a chain.
The following is an example of an excerpt from an output clean domain coordinate file (PDB format):
HEADER CLEANED-UP PDB FILE FOR SCOP DOMAIN D1HBBA_ TITLE THIS FILE IS MISSING MOST RECORDS FROM THE ORIGINAL PDB FILE COMPND HEMOGLOBIN A (DEOXY, LOW SALT, 100MM CL) SOURCE HUMAN (HOMO SAPIENS) REMARK REMARK RESOLUTION. 1.90 ANGSTROMS. REMARK SEQRES 1 A 141 VAL LEU SER PRO ALA ASP LYS THR ASN VAL LYS ALA ALA SEQRES 2 A 141 TRP GLY LYS VAL GLY ALA HIS ALA GLY GLU TYR GLY ALA SEQRES 3 A 141 GLU ALA LEU GLU ARG MET PHE LEU SER PHE PRO THR THR SEQRES 4 A 141 LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER SEQRES 5 A 141 ALA GLN VAL LYS GLY HIS GLY LYS LYS VAL ALA ASP ALA SEQRES 6 A 141 LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN SEQRES 7 A 141 ALA LEU SER ALA LEU SER ASP LEU HIS ALA HIS LYS LEU SEQRES 8 A 141 ARG VAL ASP PRO VAL ASN PHE LYS LEU LEU SER HIS CYS SEQRES 9 A 141 LEU LEU VAL THR LEU ALA ALA HIS LEU PRO ALA GLU PHE SEQRES 10 A 141 THR PRO ALA VAL HIS ALA SER LEU ASP LYS PHE LEU ALA SEQRES 11 A 141 SER VAL SER THR VAL LEU THR SER LYS TYR ARG ATOM 1 N VAL A 1 7.155 17.725 4.424 1.00 37.82 N ATOM 2 CA VAL A 1 7.854 18.800 3.718 1.00 35.10 C ATOM 3 C VAL A 1 9.366 18.565 3.754 1.00 31.92 C ATOM 4 O VAL A 1 9.861 17.961 4.721 1.00 35.01 O ATOM 5 CB VAL A 1 7.529 20.168 4.360 1.00 47.63 C ATOM 6 CG1 VAL A 1 7.806 21.300 3.369 1.00 62.84 C ATOM 7 CG2 VAL A 1 6.136 20.244 4.936 1.00 54.85 C ATOM 8 N LEU A 2 10.032 19.062 2.731 1.00 27.38 N ATOM 9 CA LEU A 2 11.496 18.967 2.657 1.00 23.24 C ATOM 10 C LEU A 2 12.077 20.110 3.496 1.00 22.99 C ATOM 11 O LEU A 2 11.672 21.259 3.289 1.00 25.22 O ATOM 12 CB LEU A 2 11.924 19.005 1.204 1.00 18.04 C ATOM 13 CG LEU A 2 11.563 17.855 0.286 1.00 17.80 C ATOM 14 CD1 LEU A 2 12.166 18.109 -1.097 1.00 20.08 C ATOM 15 CD2 LEU A 2 12.116 16.542 0.839 1.00 13.84 C ATOM 16 N SER A 3 12.979 19.784 4.391 1.00 22.22 N ATOM 17 CA SER A 3 13.652 20.792 5.257 1.00 20.53 C ATOM 18 C SER A 3 14.871 21.318 4.505 1.00 18.31 C ATOM 19 O SER A 3 15.273 20.709 3.496 1.00 17.73 O ATOM 20 CB SER A 3 14.084 20.042 6.534 1.00 17.61 C
domainer also writes out the clean domain coordinate files in EMBL-like format. The format for this EMBL-like data is described in the Input File format section of this document as it used the same format for the input clean protein EMBL-like data and the output clean domain EMBL-like data.
The following is an example of an excerpt from an output clean domain coordinate file (EMBL-like format):
ID D1HBBA_ XX DE Co-ordinates for SCOP domain D1HBBA_ XX OS See Escop.dat for domain classification XX EX METHOD xray; RESO 1.90; NMOD 1; NCHA 1; XX CN [1] XX IN ID A; NR 141; NH 0; NW 0; XX SQ SEQUENCE 141 AA; 15127 MW; 5EC7DB1E CRC32; VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH GSAQVKGHGK KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL LSHCLLVTLA AHLPAEFTPA VHASLDKFLA SVSTVLTSKY R XX CO 1 1 P 1 1 V VAL N 7.155 17.725 4.424 1.00 37.82 CO 1 1 P 1 1 V VAL CA 7.854 18.800 3.718 1.00 35.10 CO 1 1 P 1 1 V VAL C 9.366 18.565 3.754 1.00 31.92 CO 1 1 P 1 1 V VAL O 9.861 17.961 4.721 1.00 35.01 CO 1 1 P 1 1 V VAL CB 7.529 20.168 4.360 1.00 47.63 CO 1 1 P 1 1 V VAL CG1 7.806 21.300 3.369 1.00 62.84 CO 1 1 P 1 1 V VAL CG2 6.136 20.244 4.936 1.00 54.85 CO 1 1 P 2 2 L LEU N 10.032 19.062 2.731 1.00 27.38 CO 1 1 P 2 2 L LEU CA 11.496 18.967 2.657 1.00 23.24 CO 1 1 P 2 2 L LEU C 12.077 20.110 3.496 1.00 22.99 CO 1 1 P 2 2 L LEU O 11.672 21.259 3.289 1.00 25.22
domainer generates a log file, an excerpt of which is shown below. If there is a problem in processing a domain, three lines containing the record '//', the domain identifier code and an error message respectively are written. The text 'WARN filename not found' is given in cases where a clean coordinate file could not be found. 'ERROR filename file read error' or 'ERROR filename file write error' will be reported when an error was encountered during a file read or write respectively. Various other error messages may also be given (in case of difficulty email Jon Ison, jison@hgmp.mrc.ac.uk).
// DS002__ WARN Could not open for reading cpdb file s002.pxyz // DS003__ WARN Could not open for reading cpdb file s003.pxyz
EMBL-like format clean protein coordinate files
EMBL-like format clean domain coordinate files
PDB-format clean domain coordinate files
Program name | Description |
---|---|
cutgextract | Extract data from CUTG |
nrscope | Converts redundant EMBL-format SCOP file to non-redundant one |
printsextract | Extract data from PRINTS |
prosextract | Builds the PROSITE motif database for patmatmotifs to search |
rebaseextract | Extract data from REBASE |
scope | Convert raw scop classification file to embl-like format |
tfextract | Extract data from TRANSFAC |