![]() |
EMBOSS: sigscan |
% sigscan
Mandatory qualifiers: [-sigin] infile Name of signature file for input [-database] seqall Name of sequence database to search [-targetf] infile Name of (optionally grouped) scop families file for input [-thresh] integer Minimum length (residues) of overlap required for two hits with the same code to be counted as the same hit. [-sub] matrixf Residue substitution matrix [-gapo] float The gap insertion penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAMAT matrix for nucleotide sequences. [-gape] float The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. -nterm list Select number [-nhits] integer Number of hits to output [-hitsf] outfile Name of signature hits file for output [-alignf] outfile Name of signature alignments file for output Optional qualifiers: (none) Advanced qualifiers: (none) General qualifiers: -help bool report command line options. More information on associated and general qualifiers can be found with -help -verbose |
Mandatory qualifiers | Allowed values | Default | |||||||
---|---|---|---|---|---|---|---|---|---|
[-sigin] (Parameter 1) |
Name of signature file for input | Input file | test.sig | ||||||
[-database] (Parameter 2) |
Name of sequence database to search | Readable sequence(s) | ./test.seq | ||||||
[-targetf] (Parameter 3) |
Name of (optionally grouped) scop families file for input | Input file | test.fam | ||||||
[-thresh] (Parameter 4) |
Minimum length (residues) of overlap required for two hits with the same code to be counted as the same hit. | Any integer value | 20 | ||||||
[-sub] (Parameter 5) |
Residue substitution matrix | Comparison matrix file in EMBOSS data path | ./EBLOSUM62 | ||||||
[-gapo] (Parameter 6) |
The gap insertion penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAMAT matrix for nucleotide sequences. | Floating point number from 1.0 to 100.0 | 10.0 for any sequence | ||||||
[-gape] (Parameter 7) |
The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. | Floating point number from 0.0 to 10.0 | 0.5 for any sequence | ||||||
-nterm | Select number |
|
1 | ||||||
[-nhits] (Parameter 8) |
Number of hits to output | Any integer value | 100 | ||||||
[-hitsf] (Parameter 9) |
Name of signature hits file for output | Output file | test.hits | ||||||
[-alignf] (Parameter 10) |
Name of signature alignments file for output | Output file | test.align | ||||||
Optional qualifiers | Allowed values | Default | |||||||
(none) | |||||||||
Advanced qualifiers | Allowed values | Default | |||||||
(none) |
Example excerpt from a signature hits file:
DE Results of signature search XX CL All alpha proteins XX FO Globin-like XX SF Globin-like XX FA Globins XX XX HI 1 1RBPDFG 1 TRUE TRUE 234 0.0001 HI 2 1GFT35J 3 TRUE TRUE 234 0.0008 HI 3 1KJUFGH 1 TRUE TRUE 224 0.0108 HI 4 1GYU15R 2 CLOSE TRUE 220 0.1876 HI 5 1LKI89O 2 CLOSE TRUE 203 0.6787 HI 6 1QRTY58 1 TRUE TRUE 199 0.9978 HI 7 2IOM78G 1 FALSE FALSE 198 1.0844 HI 8 1SZR234 1 CLOSE TRUE 198 1.4343 HI 9 3PONI57 1 DISTANT FALSE 197 2.8849 HI 10 1PHDJBS 3 CLOSE TRUE 190 2.9872 HI 11 1HIOHDW 1 UNKNOWN UNKNOWN 160 5,8676 HI 12 199976T 1 CLOSE TRUE 140 8.8346 XX //
(1) The DE, CL, FO, SF, FA, XX and // records have the same meaning as in the hits file (above).
(2) Other lines contain either a fragment of protein sequence preceeded by an accession number, or a fragment of an alignment of a signature to the protein sequence (signature positions are marked with a '*'). The two numbers on either side of the sequence are begin and end residue numbers for that line.
Example excerpt from a signature alignment file
DE Results of signature search XX CL Alpha and beta proteins (a/b) XX FO alpha/beta-Hydrolases XX SF alpha/beta-Hydrolases XX FA Acetylcholinesterase-like XX OPSD_HUMAN 1 MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMF 45 SIGNATURE - ---------*------------*---------------*------ OPSD_XENLA 1 MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMF 45 SIGNATURE - --------*-------------*----------------*----- XX OPSD_HUMAN 46 LLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGG 90 SIGNATURE - --------------*--------------*------------*-- OPSD_XENLA 46 LLILLGLPINFMTLFVTIQHKKLRTPLNYILLNLVFANHFMVLCG 90 SIGNATURE - --------------*--------------*------------*-- XX OPSD_HUMAN 91 FTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIER 135 SIGNATURE - ---------*--*--------------------------**---- OPSD_XENLA 91 FTVTMYTSMHGYFIFGPTGCYIEGFFATLGGEVALWSLVVLAVER 135 SIGNATURE - ---------*----*-------------------------**--- XX //
Definition of classes of hit
The primary classification is an objective definition of the hit and has one of the following values:
TRAIN - the sequence was included in the original alignment from which the signature was generated.
PSIBLAST - A protein which was detected by psiblast (see psiblasts.c) to be a homologue to at least one of the proteins in the family from which the signature was derived. Such proteins are identified by the 'PSIBLAST' record in the scop families file.
OTHER - A true member of the family but not a homologue as detected by psi-blast. Such proteins may have been found from the literature and manually added to the scop families file or may have been detected by the EMBOSS program swissparse (see swissparse.c). They are identified in the
SCOP families file by the 'OTHER' record.
CROSS - A protein which is homologous to a protein of the same fold, but differnt family, of the proteins from which the signature was derived.
FALSE - A homologue to a protein with a different fold to the family of the signature.
UNKNOWN - The protein is not known to be CROSS, FALSE or a true hit (TRAIN, PSIBLAST or OTHER).
The secondary classification is provided for convenience and a value as follows:
Hits of TRAIN, PSIBLAST and OTHER classification are all listed as TRUE.
Hits of CROSS, FALSE or UNKNOWN objective classification are listed as CROSS, FALSE or UNKNOWN respectively.
The subjective column allows for hand-annotation of the hits files so that proteins of UNKNOWN objective classification can re-classified by a human expert as TRUE, FALSE, CROSS or otherwise left as UNKNOWN for the purpose of generating signature performance plots with the EMBOSS application sigplot.
Important - In the case where a signature file is generated by hand, it is essential that the gap data given is listed in order of increasing gap size.
Program name | Description |
---|---|
contacts | Reads coordinate files and writes contact files |
dichet | Parse dictionary of heterogen groups |
psiblasts | Runs PSI-BLAST given scopalign alignments |
scopalign | Generate alignments for SCOP families |
siggen | Generates a sparse protein signature |