EMBOSS: sigscan


Program sigscan

Function

Scans a sparse protein signature against swissprot

Description

sigscan scans a signature such as that generated by the EMBOSS application siggen against a protein sequence database and generates files of scored hits and corresponding alignments. An (optionally grouped) scop families file can be provided in which case a classification of hits is provided in the signature hits output file. See documentation for the EMBOSS application psiblasts for an explanation of the scop families file and groups for information on how to group it.

Signatures

Signatures extend the comcept of the motif as a tool for characterizing protein families. They consist of a set of N key residue postitions (A1, A2 ...An) preceeded by gaps (G) thus G1A1G2A2...GnAn. Both a residue and a gap can be variable. A signature is matched to a protein sequence and scored using a dynamic programming algorithm which permits variability in gap distance and residue type. Generating a signature involves identifying residues associated with points of contact in interactions between secondary structure alements. A raw signature consists of a set of positions with potential key structural roles sampled from a sequence alignment constructed with reference to this contact data. Raw signatures are refined by samplinfg different gap-residue pairs until the specificity of a signature for the family cannot be further improved.

Usage

Here is a sample session with sigscan:

% sigscan

Command line arguments

   Mandatory qualifiers:
  [-sigin]             infile     Name of signature file for input
  [-database]          seqall     Name of sequence database to search
  [-targetf]           infile     Name of (optionally grouped) scop families
                                  file for input
  [-thresh]            integer    Minimum length (residues) of overlap
                                  required for two hits with the same code to
                                  be counted as the same hit.
  [-sub]               matrixf    Residue substitution matrix
  [-gapo]              float      The gap insertion penalty is the score taken
                                  away when a gap is created. The best value
                                  depends on the choice of comparison matrix.
                                  The default value assumes you are using the
                                  EBLOSUM62 matrix for protein sequences, and
                                  the EDNAMAT matrix for nucleotide sequences.
  [-gape]              float      The gap extension, penalty is added to the
                                  standard gap penalty for each base or
                                  residue in the gap. This is how long gaps
                                  are penalized. Usually you will expect a few
                                  long gaps rather than many short gaps, so
                                  the gap extension penalty should be lower
                                  than the gap penalty. An exception is where
                                  one or both sequences are single reads with
                                  possible sequencing errors in which case you
                                  would expect many single base gaps. You can
                                  get this result by setting the gap open
                                  penalty to zero (or very low) and using the
                                  gap extension penalty to control gap
                                  scoring.
   -nterm              list       Select number
  [-nhits]             integer    Number of hits to output
  [-hitsf]             outfile    Name of signature hits file for output
  [-alignf]            outfile    Name of signature alignments file for output

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sigin]
(Parameter 1)
Name of signature file for input Input file test.sig
[-database]
(Parameter 2)
Name of sequence database to search Readable sequence(s) ./test.seq
[-targetf]
(Parameter 3)
Name of (optionally grouped) scop families file for input Input file test.fam
[-thresh]
(Parameter 4)
Minimum length (residues) of overlap required for two hits with the same code to be counted as the same hit. Any integer value 20
[-sub]
(Parameter 5)
Residue substitution matrix Comparison matrix file in EMBOSS data path ./EBLOSUM62
[-gapo]
(Parameter 6)
The gap insertion penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAMAT matrix for nucleotide sequences. Floating point number from 1.0 to 100.0 10.0 for any sequence
[-gape]
(Parameter 7)
The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. Floating point number from 0.0 to 10.0 0.5 for any sequence
-nterm Select number
1 (Align anywhere and allow only complete signature-sequence fit)
2 (Align anywhere and allow partial signature-sequence fit)
3 (Use empirical gaps only)
1
[-nhits]
(Parameter 8)
Number of hits to output Any integer value 100
[-hitsf]
(Parameter 9)
Name of signature hits file for output Output file test.hits
[-alignf]
(Parameter 10)
Name of signature alignments file for output Output file test.align
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

Excerpts from a signature hits (Figure 1) are shown. The records used are are as follows:

  1. DE - bibliographic information. The text 'Results of signature search' is always given.
  2. Four SCOP classification records are given:
  3. HI - hit data. The data are as follows (column numbers are given in parentheses).
  4. XX - used for spacing.
  5. // - The file ends with a line containing '//' only.

Example excerpt from a signature hits file:


DE   Results of signature search
XX
CL   All alpha proteins
XX
FO   Globin-like
XX
SF   Globin-like
XX
FA   Globins
XX
XX
HI   1    1RBPDFG   1    TRUE     TRUE    234  0.0001 
HI   2    1GFT35J   3    TRUE     TRUE    234  0.0008 
HI   3    1KJUFGH   1    TRUE     TRUE    224  0.0108 
HI   4    1GYU15R   2    CLOSE    TRUE    220  0.1876 
HI   5    1LKI89O   2    CLOSE    TRUE    203  0.6787 
HI   6    1QRTY58   1    TRUE     TRUE    199  0.9978 
HI   7    2IOM78G   1    FALSE    FALSE   198  1.0844
HI   8    1SZR234   1    CLOSE    TRUE    198  1.4343 
HI   9    3PONI57   1    DISTANT  FALSE   197  2.8849 
HI  10    1PHDJBS   3    CLOSE    TRUE    190  2.9872
HI  11    1HIOHDW   1    UNKNOWN  UNKNOWN 160  5,8676 
HI  12    199976T   1    CLOSE    TRUE    140  8.8346 
XX
//

Output file format

Excerpts from an alignment file are shown (Figure 2). The records used are are as follows:

(1) The DE, CL, FO, SF, FA, XX and // records have the same meaning as in the hits file (above).

(2) Other lines contain either a fragment of protein sequence preceeded by an accession number, or a fragment of an alignment of a signature to the protein sequence (signature positions are marked with a '*'). The two numbers on either side of the sequence are begin and end residue numbers for that line.

Example excerpt from a signature alignment file


DE   Results of signature search
XX
CL   Alpha and beta proteins (a/b)
XX
FO   alpha/beta-Hydrolases
XX
SF   alpha/beta-Hydrolases
XX
FA   Acetylcholinesterase-like
XX
OPSD_HUMAN      1        MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMF 45    
SIGNATURE       -        ---------*------------*---------------*------   
OPSD_XENLA      1        MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMF 45    
SIGNATURE       -        --------*-------------*----------------*-----   
XX
OPSD_HUMAN      46       LLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGG 90
SIGNATURE       -        --------------*--------------*------------*--       
OPSD_XENLA      46       LLILLGLPINFMTLFVTIQHKKLRTPLNYILLNLVFANHFMVLCG 90    
SIGNATURE       -        --------------*--------------*------------*--       
XX
OPSD_HUMAN      91       FTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIER 135   
SIGNATURE       -        ---------*--*--------------------------**----       
OPSD_XENLA      91       FTVTMYTSMHGYFIFGPTGCYIEGFFATLGGEVALWSLVVLAVER 135   
SIGNATURE       -        ---------*----*-------------------------**---       
XX
//

Definition of classes of hit

The primary classification is an objective definition of the hit and has one of the following values:

TRAIN - the sequence was included in the original alignment from which the signature was generated.

PSIBLAST - A protein which was detected by psiblast (see psiblasts.c) to be a homologue to at least one of the proteins in the family from which the signature was derived. Such proteins are identified by the 'PSIBLAST' record in the scop families file.

OTHER - A true member of the family but not a homologue as detected by psi-blast. Such proteins may have been found from the literature and manually added to the scop families file or may have been detected by the EMBOSS program swissparse (see swissparse.c). They are identified in the

SCOP families file by the 'OTHER' record.

CROSS - A protein which is homologous to a protein of the same fold, but differnt family, of the proteins from which the signature was derived.

FALSE - A homologue to a protein with a different fold to the family of the signature.

UNKNOWN - The protein is not known to be CROSS, FALSE or a true hit (TRAIN, PSIBLAST or OTHER).

The secondary classification is provided for convenience and a value as follows:

Hits of TRAIN, PSIBLAST and OTHER classification are all listed as TRUE.

Hits of CROSS, FALSE or UNKNOWN objective classification are listed as CROSS, FALSE or UNKNOWN respectively.

The subjective column allows for hand-annotation of the hits files so that proteins of UNKNOWN objective classification can re-classified by a human expert as TRUE, FALSE, CROSS or otherwise left as UNKNOWN for the purpose of generating signature performance plots with the EMBOSS application sigplot.

Data files

None.

Notes

Important - sigscan presumes that SCOP family names are unique. If this were not the case, changes to ajXyzClassifyHits would have to be made.

Important - In the case where a signature file is generated by hand, it is essential that the gap data given is listed in order of increasing gap size.

References

Ison JC, Blades MJ, Bleasby AJ, Daniel SC, Parish JH "Key residues approach to the definition of protein families and analysis of sparse family signatures" (2000) PROTEINS: Structure, Function and Genetics 40:330-341

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
contactsReads coordinate files and writes contact files
dichetParse dictionary of heterogen groups
psiblastsRuns PSI-BLAST given scopalign alignments
scopalignGenerate alignments for SCOP families
siggenGenerates a sparse protein signature

Author(s)

This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)

History

Written (July 2001) - Jon Ison

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments