EMBOSS: cons


Program cons

Function

Creates a consensus from multiple alignments

Description

cons calculates a consensus sequence from a multiple sequence alignment. To obtain the consensus, the sequence weights and a scoring matrix are used to calculate a score at each position in the alignment.

The residue (or nucleotide) i in an alignment column, is compared to all other residues (j) in the column. The score for i is the sum over all residues j (not i=j) of the score(ij)*weight(j) . Where score(ij) is taken from a nucleotide or protein scoring matrix (see -datafile qualifier) and the "weight(j)" is the weighting given to the sequence j, which is given in the alignment file.

The highest scoring type of residue is then found in the column. If the number of positive matches for this residue is greater than the "plurality value" then this residue is the consensus. The positive matches for a residue i are calculated as being the sum of weights of all the residues that increase the score of residue i (i.e. positive).

Where no consensus is found at a position i, an 'n' or an 'x' character is output; (depending on it being a DNA or protein sequence).

The "plurality" qualifier allows the user to set a cut-off for the number of positive matches below which there is no consensus.

The "identity" qualifier provides the facility of setting the required number of identities at a site for it to give a consensus at that position. Therefore, if this is set to the number of sequences in the alignment only columns of identities contribute to the consensus.

The "setcase" qualifier sets the threshold for the positive matches above which the consensus is is upper-case and below which the consensus is in lower-case.

Usage

Here is a sample session with cons:

% cons
Creates a consensus from multiple alignments
Input sequence set: aligned.fasta
Output file [outfile.cons]: aligned.cons

Command line arguments

   Mandatory qualifiers:
  [-msf]               seqset     File containing a sequence alignment.
  [-outseq]            seqout     Output sequence USA

   Optional qualifiers:
   -datafile           matrix     Scoring matrix
   -plurality          float      Set a cut-off for the number of positive
                                  matches below which there is no consensus.
                                  The default plurality is taken as half the
                                  total weight of all the sequences in the
                                  alignment.
   -setcase            float      Sets the threshold for the positive matches
                                  above which the consensus is is upper-case
                                  and below which the consensus is in
                                  lower-case.
   -identity           integer    Provides the facility of setting the
                                  required number of identities at a site for
                                  it to give a consensus at that position.
                                  Therefore, if this is set to the number of
                                  sequences in the alignment only columns of
                                  identities contribute to the consensus.
   -name               string     Name of the consensus sequence

   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-msf]
(Parameter 1)
File containing a sequence alignment. Readable sequences Required
[-outseq]
(Parameter 2)
Output sequence USA Writeable sequence <sequence>.format
Optional qualifiers Allowed values Default
-datafile Scoring matrix Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
-plurality Set a cut-off for the number of positive matches below which there is no consensus. The default plurality is taken as half the total weight of all the sequences in the alignment. Any integer value Half the total sequence weighting
-setcase Sets the threshold for the positive matches above which the consensus is is upper-case and below which the consensus is in lower-case. Any integer value 0
-identity Provides the facility of setting the required number of identities at a site for it to give a consensus at that position. Therefore, if this is set to the number of sequences in the alignment only columns of identities contribute to the consensus. Integer 0 or more 0
-name Name of the consensus sequence Any string is accepted An empty string is accepted
Advanced qualifiers Allowed values Default
(none)

Input file format

The USA of a set of aligned sequences.

Output file format

The output consists of a sequence file holding the consensus sequence. For example:


>EMBOSS_001
tagctgacctgacgggactgatgcgt

Data files

It uses the standard set of scoring matrix data files.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
megamergerMerge two large overlapping nucleic acid sequences
mergerMerge two overlapping nucleic acid sequences

Author(s)

This application was written by Tim Carver (tcarver@hgmp.mrc.ac.uk)

History

Written (Oct 2000) - Tim Carver

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments