|
EMBOSS: dreg
|
Program dreg
Function
regular expression search of a nucleotide sequence
Description
This searches for matches of a regular expression to a nucleic acid sequence.
A regular expression is a way of specifying an ambiguous pattern to
search for. Regular expressions are commonly used in some computer
programming languages and may be more familiar to some users than to
others.
The following is a short guide to regular expressions in EMBOSS:
- ^
-
use this at the start of a pattern to insist that the pattern can only
match at the start of a sequence. (eg. '^AUG' matches a start codon at
the start of the sequence)
- $
-
use this at the end of a pattern to insist that the pattern can only
match at the end of a sequence (eg. 'A+$' matches a poly-A sequence at
the end of the sequence)
- ()
-
groups a pattern. This is commonly used with '|' (eg. '(AUG)|(ATG)'
matches either the DNA or RNA form of the initiation codon )
- |
-
This is the OR operator to enable a match to be made to either one
pattern OR another. There is no AND operator in this version of regular
expressions.
The following quantifier characters specify the number of time that
the character before (in this case 'x') matches:
- x?
-
matches 0 or 1 times (ie, '' or 'x')
- x*
-
matches 0 or more times (ie, '' or 'x' or 'xx' or 'xxx', etc)
- x+
-
matches 1 or more times (ie, 'x' or 'xx' or 'xxx', etc)
Quantifiers can follow any of the following types of character specification:
- x
-
any character (ie 'A')
- \x
-
the character after the backslash is used instead of its normal
regular expression meaning. This is commonly used to turn off the
special meaning of the characters '^$()|?*+[]-.'. It may be especially
useful when searching for gap characters in a sequence (eg '\.' matches
only a dot character '.')
- [xy]
-
match one of the characters 'x' or 'y'. You may have one or more
characters in this set.
- [x-z]
-
match any one of the set of characters starting with 'x' and
ending in 'y' in ASCII order (eg '[A-G]' matches any one of: 'A', 'B',
'C', 'D', 'E', 'F', 'G')
- [^x-z]
-
matches anything except any one of the group of characters in
ASCII order (eg '[^A-G]' matches anything EXCEPT any one of: 'A', 'B',
'C', 'D', 'E', 'F', 'G')
- .
-
the dot character matches any other character (eg: 'A.G' matches
'AAG', 'AaG', 'AZG', 'A-G' 'A G', etc.)
Combining some of these features gives the example:
'([AGC]+GGG)|(TTTGGG)'
which matches one or more of any one of 'A' or
'G' or 'C' followed by three 'G's or it matches just 'TTTGGG'.
Regular expressions are case-sensitive.
The pattern 'AAAA' will not match the sequence 'aaaa'.
Usage
Here is a sample session with dreg.
% dreg
Input sequence: embl:paamir
Output file [paamir.dreg]:
Regular expression pattern: ggtacc
Command line arguments
Mandatory qualifiers:
[-sequence] seqall Sequence database USA
[-outfile] outfile Output file name
[-pattern] regexp Regular expression pattern
Optional qualifiers: (none)
Advanced qualifiers: (none)
|
Mandatory qualifiers |
Allowed values |
Default |
[-sequence] (Parameter 1) |
Sequence database USA |
Readable sequence(s) |
Required |
[-outfile] (Parameter 2) |
Output file name |
Output file |
<sequence>.dreg |
[-pattern] (Parameter 3) |
Regular expression pattern |
Any regular epression pattern is accepted |
Required |
Optional qualifiers |
Allowed values |
Default |
(none) |
Advanced qualifiers |
Allowed values |
Default |
(none) |
Input file format
Any sequence.
Output file format
This is the output from the example run. Sequence embl:paamir begins
at a restriction site with the sequence pattern GGTACC.
dreg search of embl:paamir with pattern GGTACC
Matches in PAAMIR
PAAMIR 1 GGTACC
Data files
Notes
References
Warnings
Regular expressions are case-sensitive.
The pattern 'AAAA' will not match the sequence 'aaaa'.
Diagnostic Error Messages
Exit status
Always returns 0.
Known bugs
See also
Program name | Description |
fuzznuc | Nucleic acid pattern search |
fuzzpro | Protein pattern search |
fuzztran | Protein pattern search after translation |
patmatdb | Search a protein sequence with a motif |
patmatmotifs | Search a PROSITE motif database with a protein sequence |
preg | regular expression search of a protein sequence |
pscan | Scans proteins using PRINTS |
tfscan | Scans DNA sequences for transcription factors |
Other EMBOSS programs allow you to search for simple patterns and may be
easier for the user who has never used regular expressions before:
- fuzznuc - Nucleic acid pattern search
- fuzzpro - Protein pattern search
- fuzztran - Protein pattern search after translation
Author(s)
This application was written by Peter Rice (pmr@sanger.ac.uk) Informatics
Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton,
Cambridge, CB10 1SA, UK.
History
Target users
This program is intended to be used by everyone and everything,
from naive users to embedded scripts.
Comments