EMBOSS: fuzzpro


Program fuzzpro

Function

Protein pattern search

Description

fuzzpro uses PROSITE style patterns to search protein sequences.

Patterns are specifications of a (typically short) length of sequence to be found. They can specify a search for an exact sequence or they can allow various ambiguities, matches to variable lengths of sequence and repeated subsections of the sequence.

fuzzpro intelligently selects the optimum searching algorithm to use, depending on the complexity of the search pattern specified.

Usage

Here is a sample session with fuzzpro.

% fuzzpro
Input sequence: sw:*
Search pattern: [FY]-[LIV]-G-[DE]-E-A-Q-x-[RKQ](2)-G
Number of mismatches [0]: 
Output file [5h1d_fugru.fuzzpro]:

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
   -pattern            string     Search pattern
   -mismatch           integer    Number of mismatches
  [-outf]              outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers:
   -mmshow             bool       Show mismatches
   -accshow            bool       Show accession numbers
   -usashow            bool       Showing the USA (Uniform Sequence Address)
                                  of the matching sequences will turn your
                                  output file into a 'list' file that can then
                                  be read in by many other EMBOSS programs by
                                  specifying it with a '@' in front of the
                                  filename.
   -descshow           bool       Show descriptions

   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
-pattern Search pattern Any string is accepted An empty string is accepted
-mismatch Number of mismatches Integer 0 or more 0
[-outf]
(Parameter 2)
Output file name Output file <sequence>.fuzzpro
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-mmshow Show mismatches Yes/No No
-accshow Show accession numbers Yes/No No
-usashow Showing the USA (Uniform Sequence Address) of the matching sequences will turn your output file into a 'list' file that can then be read in by many other EMBOSS programs by specifying it with a '@' in front of the filename. Yes/No No
-descshow Show descriptions Yes/No No

Input file format

Patterns for fuzzpro are based on the format of pattern used in the PROSITE database, with the difference that the terminating dot '.' and the hyphens, '-', between the characters are optional.

The PROSITE pattern definition from the PROSITE documentation follows.

For example, in SWISSPROT entry 100K_RAT you can look for the pattern:

[DE](2)HS{P}X(2)PX(2,4)C

This means: Two Asps or Glus in any order followed by His, Ser, any residue other then Pro, then two of any residue followed by Pro followed by two to four of any residue followed by Cys.

The search is case-independent, so 'AAA' matches 'aaa'.

Output file format

Here is the output from the example search:


     ACT1_FUGRU    53 YVGDEAQSKRG
     ACT2_FUGRU    53 YVGDEAQSKRG
     ACT3_FUGRU    53 YVGDEAQSKRG
     ACTC_FUGRU    55 YVGDEAQSKRG
     ACTS_FUGRU    55 YVGDEAQSKRG
     ACTT_FUGRU    55 YVGDEAQSKRG

It is composed of three columns of data.

If the option '-mmshow' is used, then an extra column of data is output indicating how many mismatches there are:

% fuzzpro -mmshow
Protein pattern search
Input sequence(s): sw:100k_rat
Search pattern: RARLX(3)R
Number of mismatches [0]: 1
Output file [100k_rat.fuzzpro]: stdout

       100K_RAT   613     1 EARLNCFRN

If the option '-desc' is used then the description of the sequence is displayed before each line showing the match details. For example:

% fuzzpro 'sw:*_HUMAN' -desc
Protein pattern search
Search pattern: [FY]-[LIV]-G-[DE]-E-A-Q-x-[RKQ](2)-G
Number of mismatches [0]: 
Output file [143b_human.fuzzpro]: stdout

ACTIN, AORTIC SMOOTH MUSCLE (ALPHA-ACTIN 2).
     ACTA_HUMAN    55 YVGDEAQSKRG
ACTIN, CYTOPLASMIC 1 (BETA-ACTIN).
     ACTB_HUMAN    53 YVGDEAQSKRG
ACTIN, ALPHA CARDIAC.
     ACTC_HUMAN    55 YVGDEAQSKRG
ACTIN, CYTOPLASMIC 2 (GAMMA-ACTIN).
     ACTG_HUMAN    53 YVGDEAQSKRG
ACTIN, GAMMA-ENTERIC SMOOTH MUSCLE (ALPHA-ACTIN 3).
     ACTH_HUMAN    54 YVGDEAQSKRG
ACTIN, ALPHA SKELETAL MUSCLE (ALPHA-ACTIN 1).
     ACTS_HUMAN    55 YVGDEAQSKRG

If the option '-acc' is also used then the accession number of the sequence is displayed before each line showing the match details. For example:

% fuzzpro 'sw:*_HUMAN' -desc -acc
Protein pattern search
Search pattern: [FY]-[LIV]-G-[DE]-E-A-Q-x-[RKQ](2)-G
Number of mismatches [0]: 
Output file [143b_human.fuzzpro]: stdout

P03996 ACTIN, AORTIC SMOOTH MUSCLE (ALPHA-ACTIN 2).
     ACTA_HUMAN    55 YVGDEAQSKRG
P02570 ACTIN, CYTOPLASMIC 1 (BETA-ACTIN).
     ACTB_HUMAN    53 YVGDEAQSKRG
P04270 ACTIN, ALPHA CARDIAC.
     ACTC_HUMAN    55 YVGDEAQSKRG
P02571 ACTIN, CYTOPLASMIC 2 (GAMMA-ACTIN).
     ACTG_HUMAN    53 YVGDEAQSKRG
P12718 ACTIN, GAMMA-ENTERIC SMOOTH MUSCLE (ALPHA-ACTIN 3).
     ACTH_HUMAN    54 YVGDEAQSKRG
P02568 ACTIN, ALPHA SKELETAL MUSCLE (ALPHA-ACTIN 1).
     ACTS_HUMAN    55 YVGDEAQSKRG

If the option '-usa' is used then the Uniform Sequence Address is output at the start of each line of match details. For example:

% fuzzpro 'sw:*_HUMAN' -usa
Protein pattern search
Search pattern: [FY]-[LIV]-G-[DE]-E-A-Q-x-[RKQ](2)-G
Number of mismatches [0]: 
Output file [143b_human.fuzzpro]: stdout
sw-id:ACTA_HUMAN             ACTA_HUMAN    55 YVGDEAQSKRG
sw-id:ACTB_HUMAN             ACTB_HUMAN    53 YVGDEAQSKRG
sw-id:ACTC_HUMAN             ACTC_HUMAN    55 YVGDEAQSKRG
sw-id:ACTG_HUMAN             ACTG_HUMAN    53 YVGDEAQSKRG
sw-id:ACTH_HUMAN             ACTH_HUMAN    54 YVGDEAQSKRG
sw-id:ACTS_HUMAN             ACTS_HUMAN    55 YVGDEAQSKRG

This is useful because it turns the output into a 'list' file of sequence names that can then be read in by other EMBOSS programs when a '@' is put at the start of the file.

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0.

Known bugs

None.

See also

Program nameDescription
antigenicFinds antigenic sites in proteins
digestProtein proteolytic enzyme or reagent cleavage digest
fuzztranProtein pattern search after translation
helixturnhelixReport nucleic acid binding motifs
oddcompFinds protein sequence regions with a biased composition
patmatdbSearch a protein sequence with a motif
patmatmotifsSearch a PROSITE motif database with a protein sequence
pepcoilPredicts coiled coil regions
pregRegular expression search of a protein sequence
pscanScans proteins using PRINTS
sigcleaveReports protein signal cleavage sites

Other EMBOSS programs allow you to search for regular expression patterns but may be less easy for the user who has never used regular expressions before:

Author(s)

This application was written by Alan Bleasby (ableasby@hgmp.mrc.ac.uk)

History

Written (2000) - Alan Bleasby
'-usa' added (13 March 2001) - Gary Williams

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments