![]() |
EMBOSS: nrscope |
The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the Protein Data Bank (PDB).
nrscope reads in the EMBL-like format SCOP classification file generated by the EMBOSS application scope, and writes a file of non-redundant domains in the same format. Domain sequences are extracted from the clean domain coordinate files generated by the EMBOSS application domainer.
The current version of nrscope removes redundancy at the level of the SCOP family, i.e. entries belonging to the same family will be non-redundant. All permutations of pair-wise sequence alignments are calculated for each SCOP family in turn, using the EMBOSS implementation of the Needleman and Wunsch global alignment algorithm. If a pair of proteins achieve greater than a threshold percentage sequence identity (specified by the user) the shortest sequence is discarded. The user must specify gap insertion and extension penalties and a residue substitution matrix for use in the alignments.
% nrscope Converts redundant EMBL-format SCOP file to non-redundant one Name of scop file for input (embl-like format) [Escop.dat]: /data/scop/Escop.dat Name of non-redundant scop file for output (embl-like format) [EscopNR.dat]: EscopNR.test Location of clean domain coordinate files for input (embl-like format) [./]: /data/cpdbscop/ File extension of clean domain coordinate files [.pxyz]: The % sequence identity redundancy threshold [95]: 95 Residue substitution file [EBLOSUM62]: Gap insertion penalty [10]: 20 Gap extension penalty [0.5]: 1 Name of log file for the build [nrscope.log]: EscopNR.log D3SDHA_ D3SDHB_ D3HBIA_ D3HBIB_ D4SDHA_ D4SDHB_ D4HBIA_ D4HBIB_ D5HBIA_ D5HBIB_
Mandatory qualifiers: [-scopin] infile Name of scop file for input (embl-like format) [-scopout] outfile Name of non-redundant scop file for output (embl-like format) [-dpdb] string Location of clean domain coordinate files for input (embl-like format) [-extn] string File extension of clean domain coordinate files [-thresh] float The % sequence identity redundancy threshold [-datafile] matrixf Residue substitution matrix [-gapopen] float The gap insertion penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. [-gapextend] float The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. [-errf] outfile Name of log file for the build Optional qualifiers: (none) Advanced qualifiers: (none) General qualifiers: -help bool report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |
---|---|---|---|
[-scopin] (Parameter 1) |
Name of scop file for input (embl-like format) | Input file | Escop.dat |
[-scopout] (Parameter 2) |
Name of non-redundant scop file for output (embl-like format) | Output file | EscopNR.dat |
[-dpdb] (Parameter 3) |
Location of clean domain coordinate files for input (embl-like format) | Any string is accepted | ./ |
[-extn] (Parameter 4) |
File extension of clean domain coordinate files | Any string is accepted | .pxyz |
[-thresh] (Parameter 5) |
The % sequence identity redundancy threshold | Any integer value | 95.0 |
[-datafile] (Parameter 6) |
Residue substitution matrix | Comparison matrix file in EMBOSS data path | EBLOSUM62 |
[-gapopen] (Parameter 7) |
The gap insertion penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. | Floating point number from 1.0 to 100.0 | 10.0 for any sequence |
[-gapextend] (Parameter 8) |
The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. | Floating point number from 0.0 to 10.0 | 0.5 for any sequence |
[-errf] (Parameter 9) |
Name of log file for the build | Output file | nrscope.log |
Optional qualifiers | Allowed values | Default | |
(none) | |||
Advanced qualifiers | Allowed values | Default | |
(none) |
Records (4) to (8) are used to describe the position of the domain in the scop hierarchy.
nrscope generates a log file, an excerpt of which is shown below. The first two lines describe the level in the SCOP hierarchy at which redundancy was removed (always 'FAMILIES' for the current implementation) and the value of the redundancy threshold. The file then contains a section for each SCOP family. Each section contains a line with the record '//' immediately followed by the name of the SCOP family, and two lines containing 'Retained' and 'Rejected' respectively. Domain identifier codes of domains that were discarded by nrscope are listed under 'Rejected' while those that appear in the output file are listed under 'Retained'. The text 'WARN filename not found' is given in cases where a clean domain coordinate file could not be found and 'WARN Empty family' where no files for an entire family could be found. 'ERROR filename file read error' will be given when an error was encountered during a file read.
FAMILIES are non-redundant 95% redundancy threshold // Homeodomain Retained D2HDDA_ D1AKHA_ D1MNMC_ Rejected D2HDDB_ D1ENH__ D3HDDA_ WARN d3hdda_.pxyz not found // Di-haem cytohrome c peroxidase WARN ds005__.pxyz not found WARN Empty family // Nuclear receptor coactivator Src-1 Retained D2PRGC_ Rejected
Program name | Description |
---|---|
cutgextract | Extract data from CUTG |
domainer | Build domain coordinate files |
printsextract | Extract data from PRINTS |
prosextract | Builds the PROSITE motif database for patmatmotifs to search |
rebaseextract | Extract data from REBASE |
scope | Convert raw scop classification file to embl-like format |
tfextract | Extract data from TRANSFAC |