EMBOSS: chips


Program chips

Function

Codon usage statistics

Description

Calculates Frank Wright's Nc statistic for the effective number of codons used (ref 1).

The Nc statistic has problems in very short sequences (20 amino acids or less) which are yet to be fully resolved. They are caused by the need to consider amino acids which are missing in the sequence.

This calculation was originally in the EGCG package as "codfish" (codon usage for fission yeast). As Frank Wright is a vegan, we looked for a meat-free name for the EMBOSS version, "chips". The official explanation is "Codon Heterozygosity (Inverse of) in a Protein-coding Sequence"

Usage

Here is a sample session with chips. If the sequence extends beyond the coding region then the start and/or end positions of the CDS must be provided because chips analyses exclusively protein coding regions.

% chips -sbeg 135 -send 1292
Input sequence: embl:paamir
Output file [paamir.chips]: 

Command line arguments

   Mandatory qualifiers:
  [-seqall]            seqall     Sequence database USA
  [-outfile]           outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers:
   -cfile              codon      Codon usage file
   -window             integer    Averaging window


Mandatory qualifiers Allowed values Default
[-seqall]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.chips
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-cfile Codon usage file Codon usage file in EMBOSS data path Ehum.cut
-window Averaging window Any integer value 30

Input file format

Output file format

This is the output from the example run:

# CHIPS codon usage statistics

Nc = 32.951

If all codons are used, the Nc value will be 61. If only one codon is used for each amino acid the Nc value will be 20. Low values therefor indicate a strong codon bias, and high values indicate a low bias and possibly a non-coding region.

Data files

chips reads a codon usage file but only as a template and ignores the original data.

The codon usage table is by default the file "CODONS/Ehum.cut" in the EMBOSS distribution directory.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

References

  1. Wright, F. (1990) Gene 87:23-29 "The 'effective number of codons' used in a gene."

Warnings

Diagnostic Error Messages

Exit status

Known bugs

See also

Program nameDescription
chaosCreate a chaos game representation plot for a sequence
codcmpCodon usage table comparison
compseqCounts the composition of dimer/trimer/etc words in a sequence
cuspCreate a codon usage table
freakResidue/base frequency table or plot
geeceeCalculates the fractional GC content of nucleic acid sequences
getorfFinds and extracts open reading frames (ORFs)
isochorePlots isochores in large DNA sequences
newcpgreportReport CpG rich areas
newcpgseekReports CpG rich regions
sycoSynonymous codon usage Gribskov statistic plot
wobbleWobble base plot
wordcountCounts words of a specified size in a DNA sequence

Author(s)

This application was written by Alan Bleasby (ableasby@hgmp.mrc.ac.uk)

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments