EMBOSS: matcher


Program matcher

Function

Finds the best local alignments between two sequences

Description

Compares two sequences looking for local sequence similarities.

Matcher is based on Bill Pearson's 'lalign' application, version 2.0u4 Feb. 1996

Lalign uses code developed by X. Huang and W. Miller (Adv. Appl. Math. (1991) 12:337-357) for the "sim" program.

Matcher will report a specified number of alignments between the two sequences showing the actual local alignments.

Usage

Here is a sample session with matcher.

 % matcher sw:hba_human sw:hbb_human
Output file [hba_human.matcher]: 

  43.4% identity in 145 HBA_HUMAN overlap; score:  264

Command line arguments

   Mandatory qualifiers:
  [-sequencea]         sequence   Sequence USA
  [-sequenceb]         sequence   Sequence USA
  [-outfile]           outfile    Output file name

   Optional qualifiers:
   -datafile           matrix     Matrix file
   -alternatives       integer    This sets the number of alternative matches
                                  output. By default only the highest scoring
                                  alignment is shown. A value of 2 gves you
                                  other reasonable alignments. In some cases,
                                  for example multidomain proteins of cDNA and
                                  gemomic DNA comparisons, there may be other
                                  interesting and significant alignments.
   -gappenalty         integer    The gap penalty is the score taken away when
                                  a gap is created. The best value depends on
                                  the choice of comparison matrix. The
                                  default value of 14 assumes you are using
                                  the EBLOSUM62 matrix for protein sequences,
                                  or a value of 16 and the EDNAMAT matrix for
                                  nucleotide sequences.
   -gaplength          integer    The gap length, or gap extension, penalty is
                                  added to the standard gap penalty for each
                                  base or residue in the gap. This is how long
                                  gaps are penalized. Usually you will expect
                                  a few long gaps rather than many short
                                  gaps, so the gap extension penalty should be
                                  lower than the gap penalty. An exception is
                                  where one or both sequences are single
                                  reads with possible sequencing errors in
                                  which case you would expect many single base
                                  gaps. You can get this result by setting
                                  the gap penalty to zero (or very low) and
                                  using the gap extension penalty to control
                                  gap scoring.
   -markx              integer    This sets the alternate display of matches
                                  and mismatches in alignments.
                                  -markx=0 uses ':','.',' ', for identities,
                                  conservative replacements, and
                                  non-conservative replacements, respectively.
                                  -markx=1 uses ' ','x', and 'X'.
                                  -markx=2 does not show the second sequence,
                                  but uses the second alignment line to
                                  display matches with a '.' for identity, or
                                  with the mismatched residue for mismatches.
                                  -markx=3 outputs a title line with the
                                  percentage identity and score and then
                                  outputs the gapped sequences in multiple
                                  FASTA format.
                                  -markx=4 outputs only the title line with
                                  the percentage identity and score.
                                  -markx=5,6,7,8 and 9 are the same as
                                  -markx=1
                                  -markx=10 outputs a parseable output.
   -length             integer    number of residues per line

   Advanced qualifiers: (none)

Mandatory qualifiers Allowed values Default
[-sequencea]
(Parameter 1)
Sequence USA Readable sequence Required
[-sequenceb]
(Parameter 2)
Sequence USA Readable sequence Required
[-outfile]
(Parameter 3)
Output file name Output file <sequence>.matcher
Optional qualifiers Allowed values Default
-datafile Matrix file Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAMAT for DNA
-alternatives This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gves you other reasonable alignments. In some cases, for example multidomain proteins of cDNA and gemomic DNA comparisons, there may be other interesting and significant alignments. Integer 1 or more 1
-gappenalty The gap penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value of 14 assumes you are using the EBLOSUM62 matrix for protein sequences, or a value of 16 and the EDNAMAT matrix for nucleotide sequences. Positive integer 14 for protein, 16 for nucleic
-gaplength The gap length, or gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap penalty to zero (or very low) and using the gap extension penalty to control gap scoring. Positive integer 4 for any sequence
-markx This sets the alternate display of matches and mismatches in alignments. -markx=0 uses ':','.',' ', for identities, conservative replacements, and non-conservative replacements, respectively. -markx=1 uses ' ','x', and 'X'. -markx=2 does not show the second sequence, but uses the second alignment line to display matches with a '.' for identity, or with the mismatched residue for mismatches. -markx=3 outputs a title line with the percentage identity and score and then outputs the gapped sequences in multiple FASTA format. -markx=4 outputs only the title line with the percentage identity and score. -markx=5,6,7,8 and 9 are the same as -markx=1 -markx=10 outputs a parseable output. Integer up to 10 0
-length number of residues per line Integer from 1 to 200 60
Advanced qualifiers Allowed values Default
(none)

Input file format

Any 2 sequence USAs or the same type (DNA or protein).

Output file format

The output from matcher is a sequence alignment.

There are several ways to change the output format using -markx.

Here is the output for the example:



              10        20        30        40         50          
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
       :.: .:. : : ::::  .. : :.::: :... .: :. .:  : :::      :. .:
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
             10          20        30        40        50        60

          60        70        80        90       100       110     
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
       :.:::::  :.....::.:.. .....::.::. ::.::: ::.::.. :. .:: :.  
HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
               70        80        90       100       110       120

         120       130       140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
       :::: :.:. .: .:.:...:. ::
HBB_HU EFTPPVQAAYQKVVAGVANALAHKY
              130       140     

Here are the example outputs using other values of -markx:


% matcher sw:hba_human sw:hbb_human stdout -markx 1
Finds the best local alignments between two sequences

  43.4% identity in 145 HBA_HUMAN overlap; score:  264

              10        20        30        40         50          
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
        x Xx xX X X      xxX X x   X xxxXx X xXx XX     X      xXx 
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
             10          20        30        40        50        60

          60        70        80        90       100       110     
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
        x     XX xxxxx  x xxXxxxxx  x  xX  x   X  x  xxX xXx  X xXX
HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
               70        80        90       100       110       120

         120       130       140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
           X x xXx Xx x xxx xX  
HBB_HU EFTPPVQAAYQKVVAGVANALAHKY
              130       140     



% matcher sw:hba_human sw:hbb_human stdout -markx 2
Finds the best local alignments between two sequences

  43.4% identity in 145 HBA_HUMAN overlap; score:  264

              10        20        30        40         50          
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
HBB_HU .T.EE.SA.T.L....--NVD.V.G...G.LLVVY.W.QRF.ES.G...TPDAVM.NPK.

          60        70        80        90       100       110     
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
HBB_HU .A.....LG.FSDGL..L.NLKGTFAT..E..CD..H...E..R..GNV.VCV..H.FGK

         120       130       140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
HBB_HU ....P.Q.AYQ.VV.G.ANA.AH..



% matcher sw:hba_human sw:hbb_human stdout -markx 3
Finds the best local alignments between two sequences

  43.4% identity in 145 HBA_HUMAN overlap; score:  264
>HBA_HUMAN ..
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH
-----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP
VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
>HBB_HUMAN ..
LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY



% matcher sw:hba_human sw:hbb_human stdout -markx 4
Finds the best local alignments between two sequences

  43.4% identity in 145 HBA_HUMAN overlap; score:  264



% matcher sw:hba_human sw:hbb_human stdout -markx 10
Finds the best local alignments between two sequences
>>#1
; sw_score: 264
; sw_ident: 0.434
; sw_overlap: 145
>HBA_HUMAN ..
; sq_len: -115
; al_start: 2
; al_stop: 140
; al_display_start: 2
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH
-----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP
VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
>HBB_HUMAN ..
; sq_len: 5
; al_start: 3
; al_stop: 145
; al_display_start: 3
LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY

Note that the parseable output starts each alignment record with ">>" while each aligned sequence record starts with ">".

All parameters produced will be of the form: ; xx_yyyyy

In this version, we have xx: sw - Smith-Waterman scores sq - sequence length, type al - alignment start, stop, display_offset

All of the output parameters correspond to values that are presented in other output formats, with the exception of the "al_" parameters.

al_start gives the location of the alignment start in the original sequence

al_stop gives the location of the end of the alignment in the original sequence

al_display_start gives the location of the first displayed amino acid residue in the original sequence. The -markx=10 alignments are the same as those produced in the other modes. If the beginning of the first sequence aligns with the 10'th residue of the second sequence, then the first sequence will be padded with ten leading "-" to produce the alignment. The leading '-' are a formatting convenience only; they are not considered in the numbering system for al_display_start, al_start, or al_stop.

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAMAT is used.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

References

  1. X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381
  2. Needleman, S. B. and Wunsch, C. D. (1970) J. Mol. Biol. 48, 443-453.

Warnings

Diagnostic Error Messages

Exit status

0 upon successful completion.

Known bugs

See also

Program nameDescription
waterSmith-Waterman local alignment

Author(s)

This program was originally written by Bill Pearson as part of the FASTA package under the name 'lalign'.

This application was modified for inclusion in EMBOSS by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

 Completed 11th May 1999.
 Last modified 19th July 1999.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments