EMBOSS: fuzznuc


Program fuzznuc

Function

Nucleic acid pattern search

Description

fuzznuc uses PROSITE style patterns to search nucleotide sequences.

Patterns are specifications of a (typically short) length of sequence to be found. They can specify a search for an exact sequence or they can allow various ambiguities, matches to variable lengths of sequence and repeated subsections of the sequence.

fuzznuc intelligently selects the optimum searching algorithm to use, depending on the complexity of the search pattern specified.

Usage

Here is a sample session with fuzznuc.

% fuzznuc
Input sequence: embl:hhtetra
Search pattern: AAGCTT
Number of mismatches [0]: 
Output file [hhtetra.fuzznuc]:

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
   -pattern            string     Search pattern
   -mismatch           integer    Number of mismatches
  [-outf]              outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers:
   -mmshow             bool       Show mismatches
   -accshow            bool       Show accession numbers
   -descshow           bool       Show descriptions
   -usashow            bool       Showing the USA (Uniform Sequence Address)
                                  of the matching sequences will turn your
                                  output file into a 'list' file that can then
                                  be read in by many other EMBOSS programs by
                                  specifying it with a '@' in front of the
                                  filename.
   -complement         bool       Search complementary strand

   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
-pattern Search pattern Any string is accepted An empty string is accepted
-mismatch Number of mismatches Integer 0 or more 0
[-outf]
(Parameter 2)
Output file name Output file <sequence>.fuzznuc
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-mmshow Show mismatches Yes/No No
-accshow Show accession numbers Yes/No No
-descshow Show descriptions Yes/No No
-usashow Showing the USA (Uniform Sequence Address) of the matching sequences will turn your output file into a 'list' file that can then be read in by many other EMBOSS programs by specifying it with a '@' in front of the filename. Yes/No No
-complement Search complementary strand Yes/No No

Input file format

Patterns for fuzznuc are based on the format of pattern used in the PROSITE database, with the difference that the terminating dot '.' and the hyphens, '-', between the characters are optional.

The PROSITE pattern definition from the PROSITE documentation (amended to refer to nucleic acid sequences, not proteins) follows.

For example, in the EMBL entry ECLAC you can look for the pattern:


[CG](5)TG{A}N(1,5)C

This searches for "C or G" 5 times, followed by T and G, then anything except A, then any base (1 to 5 times) before a C.

You can use ambiguity codes for nucleic acid searches but not within [] or {} as they expand to bracketed counterparts. For example, "s" is expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is illegal.

Note the use of X is reserved for proteins. You must use N for nucleic acids to refer to any base.

The search is case-independent, so 'AAA' matches 'aaa'.

Output file format

Here is the output from the example search:


        HHTETRA     1 AAGCTT
        HHTETRA  1267 AAGCTT

It is composed of three columns of data.

If the option '-mmshow' is used, then an extra column of data is output indicating how many mismatches there are in the match, for example:

% fuzznuc embl:hhtetra -mmshow
Nucleic acid pattern search
Search pattern: AAGCTT
Number of mismatches [0]: 1
Output file [hhtetra.fuzznuc]: stdout


        HHTETRA       53     1 AAGCTG
        HHTETRA      140     1 AAGCAT
        HHTETRA      314     1 AACCTT
        HHTETRA      350     1 AAGCCT
        HHTETRA      374     1 AAGTTT
        HHTETRA     1009     1 AAGTTT
        HHTETRA     1259     1 AAGTTT
        HHTETRA     1267     0 AAGCTT

If the option '-desc' is used then the description of the sequence is displayed before each line showing the match details. For example:

% fuzznuc embl:hhtetra -desc
Nucleic acid pattern search
Search pattern: AAGCTT
Number of mismatches [0]: 
Output file [hhtetra.fuzznuc]: stdout


Human herpesvirus 7 (clone ED132'1.2) telomeric repeat region.
        HHTETRA        1 AAGCTT
Human herpesvirus 7 (clone ED132'1.2) telomeric repeat region.
        HHTETRA     1267 AAGCTT

If the option '-acc' is also used then the accession number of the sequence is displayed before each line showing the match details. For example:

% fuzznuc embl:hhtetra -desc -acc
Nucleic acid pattern search
Search pattern: AAGCTT
Number of mismatches [0]: 
Output file [hhtetra.fuzznuc]: stdout


L46634 Human herpesvirus 7 (clone ED132'1.2) telomeric repeat region.
        HHTETRA        1 AAGCTT
L46634 Human herpesvirus 7 (clone ED132'1.2) telomeric repeat region.
        HHTETRA     1267 AAGCTT

If the option '-usa' is used then the Uniform Sequence Address is output at the start of each line of match details. For example:

% fuzznuc embl:hhtetra -usa
Nucleic acid pattern search
Search pattern: AAGCTT
Number of mismatches [0]: 
Output file [hhtetra.fuzznuc]: stdout


embl-id:HHTETRA         HHTETRA        1 AAGCTT
embl-id:HHTETRA         HHTETRA     1267 AAGCTT

This is useful because it turns the output into a 'list' file of sequence names that can then be read in by other EMBOSS programs when a '@' is put at the start of the file.

If the option '-comp' is used, then the search will also be done on the reverse sense strand. Any matches in that strand will be displayed with the start position using the forward-sense positions (it actually gives the position of the end of the match). The matching sequence will be given in square brackets to distinguish this from a forward-sense match. For example:

% fuzznuc embl:hhtetra -comp
Nucleic acid pattern search
Search pattern: AAGCTT
Number of mismatches [0]: 
Output file [hhtetra.fuzznuc]: stdout


        HHTETRA        1 AAGCTT
        HHTETRA     1267 AAGCTT
        HHTETRA        1 [AAGCTT]
        HHTETRA     1267 [AAGCTT]

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0.

Known bugs

None.

See also

Program nameDescription
dregregular expression search of a nucleotide sequence
fuzztranProtein pattern search after translation
marscanFinds MAR/SAR sites in nucleic sequences

Other EMBOSS programs allow you to search for regular expression patterns but may be less easy for the user who has never used regular expressions before:

Author(s)

This application was written by Alan Bleasby (ableasby@hgmp.mrc.ac.uk)

History

Written (2000) - Alan Bleasby
'-usa' added (13 March 2001) - Gary Williams

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments