|
Netmond V2. Regular expression (regex)
The Netmond built-in regex is a regular expression pattern
matching and replacement language. It writen in C.
These routines are the equivalents of regex routines as found in
4.nBSD UN*X, with minor extensions. These routines are derived from
various implementations found in software tools books, and Conroy's grep.
They are NOT derived from licensed/restricted software.
For more interesting/academic/complicated implementations, see Henry
Spencer's regexp routines, or GNU Emacs pattern matching module.
Regular Expressions:
- [1] char
- Matches itself, unless it is a special
character (metachar): . \ [ ] * + ^ $ ( )
- [2] .
- Matches any character.
- [3] \
- Matches the character following it, except
when followed by a left or right round bracket,
a digit 1 to 9 or a left or right angle bracket.
(see [7], [8] and [9])
It is used as an escape character for all
other meta-characters, and itself. When used
in a set ([4]), it is treated as an ordinary
character.
- [4] [set]
- Matches one of the characters in the set.
If the first character in the set is "^",
it matches a character NOT in the set. A
shorthand S-E is used to specify a set of
characters S upto E, inclusive. The special
characters "]" and "-" have no special
meaning if they appear as the first chars
in the set. Examples:
- [a-z]
- Match any lowercase alpha.
- [^]-]
- Match any char except "]" and "-".
- [^A-Z]
- Match any char except uppercase alpha.
- [a-zA-Z]
- Match any alpha.
- [5] *
- Any regular expression form [1] to [4], followed by
closure char "*" matches zero or more matches of
that form.
- [6] +
- Same as [5], except it matches one or more.
- [7] (tag)
- A regular expression in the form [1] to [10], enclosed
as (form) matches what form matches. The enclosure
creates a set of tags, used for [8] and for
pattern substution. The tagged forms are numbered
starting from 1.
- [8] \digit
- A \ followed by a digit 1 to 9 matches whatever a
previously tagged regular expression ([7]) matched.
- [9] \<word\>
- A regular expression starting with a \< construct
and/or ending with a \> construct, restricts the
pattern matching to the beginning of a word, and/or
the end of a word. A word is defined to be a character
string beginning and/or ending with the characters
A-Z a-z 0-9 and _. It must also be preceded and/or
followed by any character outside those mentioned.
- [10] xy
- A composite regular expression xy where x and y
are in the form [1] to [10] matches the longest
match of x followed by a match for y.
- [11] ^exact$
- A regular expression starting with a "^" character
and/or ending with a "$" character, restricts the
pattern matching to the beginning of the line,
or the end of line. [anchors] Elsewhere in the
pattern, "^" and "$" are treated as ordinary characters.
- Authors:
- Writen in C by Ozan S. Yigit (oz), Dept. of Computer Science, York University.
ANSI prototypes and regex.h added by Mark Russell, UKC.
- Acknowledgements:
- HCR's Hugh Redelmeier has been most helpful in various
stages of development. He convinced me to include BOW
and EOW constructs, originally invented by Rob Pike at
the University of Toronto.
- References:
- Software tools in Pascal, Kernighan & Plauger.
Grep [rsx-11 C dist], David Conroy.
ed - text editor, Un*x Programmer's Manual.
Advanced editing on Un*x, B. W. Kernighan.
RegExp routines, Henry Spencer.
- Notes:
- This implementation uses a bit-set representation for character
classes for speed and compactness. Each character is represented
by one bit in a 128-bit block. Thus, CCL or NCL always takes a
constant 16 bytes in the internal dfa, and yre_exec does a single
bit comparison to locate the character in the set.
- Examples:
-
Pattern | Matches |
|
foo*.* | fo foo fooo foobar fobar foxx ... |
fo[ob]a[rz] |
fobar fooar fobaz fooaz |
foo\\+ |
foo\ foo\\ foo\\\ ... |
(foo)[1-3]\1 (same as foo[1-3]foo) |
foo1foo foo2foo foo3foo |
(fo.*)-\1 |
foo-foo fo-fo fob-fob foobar-foobar ... |
© 1998-2002, Rinet Software
|