EMBOSS: supermatcher


Program supermatcher

Function

Finds a match of a large sequence against one or more sequences

Description

This is a rough and ready local alignment program for large sequences. The reason it is rough and ready is that wordmatch is used to find all the wordmatches between the first sequence and another sequence. Then by calculating the highest score for a diagonal we can then use this as the centre point for a Smith-Waterman type calculation of a width given by the user. So a narrow diagonal smith-waterman is calculated hence the results will be rough but due to the space saving much larger sequences can be aligned.

Usage

Here is a sample session with supermatcher.

 supermatcher ~/wordtest/U68037 ~/wordtest/AB003171 -noscoreonly

Finds a match of a large sequence against one or more sequences
Gap opening penalty [10.0]: 
Gap extension penalty [0.5]: 3.0
Output file [stdout]: 
Local: RNU68037 vs EM:AB003171
Score: 30.00

RNU68037        820   tcaaccacagctgccctccgcagctctcggggag.gcggc.tccg 862  
                      ||||| ||||   || |  ||||     |||||| ||  | ||| 
EM:AB003171     2492  tcaactacag.aaccatgtgcag....aggggagagctccatcct 2531 

RNU68037        863   cgcgcagggttcacgcacacga.cgtgg.aaatggtgggccagct 905  
                       |  ||  ||| | || || || | ||| ||   ||||   |  |
EM:AB003171     2532  tgaaca..gttaaagc.ca.gagcttggtaacaagtggataaatt 2572 

RNU68037        906   .cgtgggcatcatggtggtgtc.gtg..catctgctggagc     942  
                       | |    |||||  ||  ||| | |  ||  |||  ||||
EM:AB003171     2573  acat....atcattttgcggtctgagaacacatgc.agagc      2608 


Command line arguments

   Mandatory qualifiers:
  [-seqa]              seqall     Sequence database USA
  [-seqb]              seqset     Sequence set USA
   -gapopen            float      Gap opening penalty
   -gapextend          float      Gap extension penalty
   -outfile            align      (no help text) align value

   Optional qualifiers:
   -datafile           matrixf    Matrix file
   -width              integer    Alignment width
   -wordlen            integer    word length for initial matching
   -errorfile          outfile    Error file to be written to

   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-seqa]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
[-seqb]
(Parameter 2)
Sequence set USA Readable sequences Required
-gapopen Gap opening penalty Number from 1.000 to 100.000 10.0 for any sequence type
-gapextend Gap extension penalty Number from 0.100 to 10.000 0.5 for any sequence type
-outfile (no help text) align value Alignment file  
Optional qualifiers Allowed values Default
-datafile Matrix file Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
-width Alignment width Any integer value 16
-wordlen word length for initial matching Integer 3 or more 6
-errorfile Error file to be written to Output file supermatcher.error
Advanced qualifiers Allowed values Default
(none)

Input file format

Two sequence USAs.

Output file format

supermatcher.error will contain any errors that occured during the program. This maybe that wordmatch could not find any matches hence no suitable start point is found for the smith-waterman calculation.

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAMAT is used. Others can be specified.

Notes

The time this program takes to do an alignment depends very much on the word size. For short sequences a short word size (e.g. 4) can make it take a very long time. Large word sizes (e.g. 30) for sequences that are very similar give a very quick result. The default of 16 should give reasonable fast alignments.

Because it does a Smith & Waterman alignment (albeit in a narrow region around the diagonal shown to be the 'best' by a word match), this program can use huge amounts of memory if the sequences are large.

Because the alignment is made within a narrow area each side of the 'best' diagonal, if there are sufficient indels between the two sequences, then the path of the Smith & Waterman alignment can wander outside of this area. Making the width larger can avoid this problem, but you then use more memory.

The longer the sequences and the wider the specified alignment width, the more memory will be used.

If the program terminates due to lack of memory you can try the following:

Run the UNIX command 'limit' to see if your stack or memory usage have been limited and if so, run 'unlimit', (e.g.: '% unlimit stacksize').

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0.

Known bugs

None.

See also

Program nameDescription
matcherFinds the best local alignments between two sequences
seqmatchallDoes an all-against-all comparison of a set of sequences
waterSmith-Waterman local alignment
wordmatchFinds all exact matches of a given size between 2 sequences

Author(s)

This application was written by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

Finished.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments