EMBOSS: fuzztran


Program fuzztran

Function

Protein pattern search after translation

Description

fuzztran uses PROSITE style protein patterns to search nucleic acid sequences translated in the specified frame(s).

Patterns are specifications of a (typically short) length of sequence to be found. They can specify a search for an exact sequence or they can allow various ambiguities, matches to variable lengths of sequence and repeated subsections of the sequence.

fuzztran intelligently selects the optimum searching algorithm to use, depending on the complexity of the search pattern specified.

Usage

Here is a sample session with fuzztran, using all options.

% fuzztran -opt
Protein pattern search after translation
Input sequence(s): embl:rnops
Translation frames
         1 : 1
         2 : 2
         3 : 3
         F : Forward three frames
        -1 : -1
        -2 : -2
        -3 : -3
         R : Reverse three frames
         6 : All six frames
Frame(s) to translate [1]: f
Genetic codes
         0 : Standard
         1 : Standard (with alternative initiation codons)
         2 : Vertebrate Mitochondrial
         3 : Yeast Mitochondrial
         4 : Mold/Protozoan/Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
         5 : Invertebrate Mitochondrial
         6 : Ciliate Macronuclear and dasycladacean
         9 : Echinoderm Mitochondrail
        10 : Euplotid Nuclear
        11 : Bacterial
        12 : Alternative Yeast Nuclear
        13 : Ascidian Mitochondrial
        14 : Flatworm Mitochondrial
        15 : Blepharisma Macronuclear
Genetic code to use [0]: 
Search pattern: RA
Number of mismatches [0]:  
Output file [rnops.fuzztran]:

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
   -pattern            string     Search pattern
   -mismatch           integer    Number of mismatches
  [-outf]              outfile    Output file name

   Optional qualifiers:
   -frame              list       Frame(s) to translate
   -table              list       Genetic code to use

   Advanced qualifiers:
   -mmshow             bool       Show mismatches
   -accshow            bool       Show accession numbers
   -descshow           bool       Show descriptions


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
-pattern Search pattern Any string is accepted An empty string is accepted
-mismatch Number of mismatches Integer 0 or more 0
[-outf]
(Parameter 2)
Output file name Output file <sequence>.fuzztran
Optional qualifiers Allowed values Default
-frame Frame(s) to translate
1 (1)
2 (2)
3 (3)
F (Forward three frames)
-1 (-1)
-2 (-2)
-3 (-3)
R (Reverse three frames)
6 (All six frames)
1
-table Genetic code to use
0 (Standard)
1 (Standard (with alternative initiation codons))
2 (Vertebrate Mitochondrial)
3 (Yeast Mitochondrial)
4 (Mold/Protozoan/Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5 (Invertebrate Mitochondrial)
6 (Ciliate Macronuclear and dasycladacean)
9 (Echinoderm Mitochondrail)
10 (Euplotid Nuclear)
11 (Bacterial)
12 (Alternative Yeast Nuclear)
13 (Ascidian Mitochondrial)
14 (Flatworm Mitochondrial)
15 (Blepharisma Macronuclear)
0
Advanced qualifiers Allowed values Default
-mmshow Show mismatches Yes/No No
-accshow Show accession numbers Yes/No No
-descshow Show descriptions Yes/No No

Input file format

Patterns for fuzztran are based on the format of pattern used in the PROSITE database, with the difference that the terminating dot '.' and the hyphens, '-', between the characters are optional.

The PROSITE pattern definition from the PROSITE documentation follows.

For example, you can look for the pattern:


[DE](2)HS{P}X(2)PX(2,4)C

This means: Two Asps or Glus in any order followed by His, Ser, any residue other then Pro, then two of any residue followed by Pro followed by two to four of any residue followed by Cys.

The search is case-independent, so 'AAA' matches 'aaa'.

Output file format

Here is the output from the example search:

          RNOPS  1       97 RA
          RNOPS  1      133 RA
          RNOPS  1      421 RA
          RNOPS  1      625 RA
          RNOPS  1      835 RA
          RNOPS  1      919 RA
          RNOPS  2      227 RA
          RNOPS  2      752 RA
          RNOPS  3       72 RA

It is composed of four columns of data.

If the option '-mmshow' is used, then an extra fourth column of data is output indicating how many mismatches there are:


% fuzztran -mmshow -frame 6
Protein pattern search after translation
Input sequence(s): embl:rnops
Search pattern: TWLWLT
Number of mismatches [0]: 2
Output file [rnops.fuzztran]: stdout

          RNOPS  1      316     0 TWLWLT
          RNOPS  1     1138     2 QRLWLT

Data files

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

The Genetic Code data files are based on the NCBI genetic code tables. Their names and descriptions are:

EGC.0
Standard (Differs from GC.1 in that it only has initiation site 'AUG')
EGC.1
Standard
EGC.2
Vertebrate Mitochondrial
EGC.3
Yeast Mitochondrial
EGC.4
Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
EGC.5
Invertebrate Mitochondrial
EGC.6
Ciliate Macronuclear and Dasycladacean
EGC.9
Echinoderm Mitochondrial
EGC.10
Euplotid Nuclear
EGC.11
Bacterial
EGC.12
Alternative Yeast Nuclear
EGC.13
Ascidian Mitochondrial
EGC.14
Flatworm Mitochondrial
EGC.15
Blepharisma Macronuclear

The format of these files is very simple.

It consists of several lines of optional comments, each starting with a '#' character.

These are followed the line: 'Genetic Code [n]', where 'n' is the number of the genetic code file.

This is followed by the description of the code and then by four lines giving the IUPAC one-letter code of the translated amino acid, the start codons (indicdated by an 'M') and the three bases of the codon, lined up one on top of the other.

For example:


------------------------------------------------------------------------------
# Genetic Code Table
#
# Obtained from: http://www.ncbi.nlm.nih.gov/collab/FT/genetic_codes.html
# and: http://www3.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c
#
# Differs from Genetic Code [1] only in that the initiation sites have been
# changed to only 'AUG'

Genetic Code [0]
Standard
 
AAs  =   FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = -----------------------------------M----------------------------
Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
------------------------------------------------------------------------------

Notes

References

Warnings

When translating using non-standard genetic code table, always check the table carefully for deviations from your particular organism's code.

Diagnostic Error Messages

Exit status

It exits with status 0

Known bugs

See also

Program nameDescription
dregregular expression search of a nucleotide sequence
fuzznucNucleic acid pattern search
fuzzproProtein pattern search
patmatdbSearch a protein sequence with a motif
patmatmotifsSearch a PROSITE motif database with a protein sequence
pregregular expression search of a protein sequence
pscanScans proteins using PRINTS
tfscanScans DNA sequences for transcription factors

Other EMBOSS programs allow you to search for regular expression patterns but may be less easy for the user who has never used regular expressions before:

Author(s)

This application was written by Alan Bleasby (ableasby@hgmp.mrc.ac.uk)

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments