In the previous section, we were dealing with a pattern where the same character could be repeated up to 3 times. There is another way to express
this in regular expressions, which some people find more readable.
Example 7.5. The old way: every character optional
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMMM')
>>>
|
This matches the start of the string, then the first optional M, but not the second and third M (but that's OK because they're optional), then end of string.
|
|
This matches the start of the string, then the first and second optional M, but not the third M (but that's OK because it's optional), then end of string.
|
|
This matches the start of the string, then all three optional M, then end of string.
|
|
This matches the start of the string, then all three optional M, but then does not match the end of string (because there is still one unmatched M), so the pattern does not match and returns None.
|
Example 7.6. The new way: from n to m
>>> pattern = '^M{0,3}$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EEDA8>
>>> re.search(pattern, 'MMMM')
>>>
|
This pattern says: “Match the start of the string, then anywhere from 0 to 3 M characters, then end of string.” The 0 and 3 can be any numbers; if you want to match at least 1 but no more than 3 M characters, you could say M{1,3}.
|
|
This matches the start of the string, then 1 M out of a possible 3, then end of string.
|
|
This matches the start of the string, then 2 M out of a possible 3, then end of string.
|
|
This matches the start of the string, then 3 M out of a possible 3, then end of string.
|
|
This matches the start of the string, then 3 M out of a possible 3, but then does not match the end of string. The regular expression only allows for up to 3 M characters before end of string, but we have 4, so the pattern does not match and returns None.
|
 |
There is no way to programmatically determine that two regular expressions are equivalent. The best you can do is write lots
of test cases to make sure they behave the same way on all relevant inputs. We'll talk more about writing test cases later
in this book.
|
Now let's expand our Roman numeral regular expression to cover the tens and ones place.
Example 7.7. The tens place
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
>>> re.search(pattern, 'MCMXL')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCML')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLX')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXX')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXX')
>>>
|
This matches the start of the string, then the first optional M, then CM, then XL, then end of string. Remember, the (A|B|C) syntax means “match exactly one of A, B, or C”. We match XL, so we ignore the XC and L?X?X?X? choices, and then move on to end of string.
|
|
This matches the start of the string, then the first optional M, then CM, then L?X?X?X?. Of the L?X?X?X?, it matches the L and skips all 3 optional X characters. Then end of string.
|
|
This matches the start of the string, then the first optional M, then CM, then the optional L and the first optional X, skips the second and third optional X, then end of string.
|
|
This matches the start of the string, then the first optional M, then CM, then the optional L and all 3 optional X characters, then end of string.
|
|
This matches the start of the string, then the first optional M, then CM, then the optional L and all 3 optional X characters, then fails to match the end of string because there is still one more X unaccounted for. So the entire pattern fails to match, and returns None.
|
The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result:
Example 7.8. The ones place
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
OK, so what does that look like using this alternate {n,m} syntax?
Example 7.9. Validating Roman numerals with {n,m}
>>> pattern = '^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMMDCCCLXXXVIII')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'I')
<_sre.SRE_Match object at 0x008EEB48>
|
This matches the start of the string, then 1 of a possible 4 M characters, then D?C{0,3}. Of that, it matches the optional D and 0 of 3 possible C characters. Moving on, it matches L?X{0,3} by matching the optional L and 0 of 3 possible X characters. Then it matches V?I{0,3} by matching the optional V and 0 of 3 possible I characters, and finally end of string. (Are your eyes bleeding yet?)
|
|
This matches the start of the string, then 2 of a possible 4 M characters, then the D?C{0,3} with a D and 1 of 3 possible C characters. Then L?X{0,3} with an L and 1 of 3 possible X characters. Then V?I{0,3} with a V and 1 of 3 possible I characters. Then end of string.
|
|
This matches the start of the string, then 4 out of 4 M characters, then D?C{0,3} with a D and 3 out of 3 C characters. Then L?X{0,3} with an L and 3 out of 3 X characters. Then V?I{0,3} with a V and 3 out of 3 I characters. Then end of string.
|
|
Watch closely. (I feel like a magician. “Watch closely, kids, I'm going to pull a rabbit out of my hat.”) This matches the start of the string, then 0 out of 4 M, then matches D?C{0,3} by skipping the optional D and matching 0 out of 3 C, then matches L?X{0,3} by skipping the optional L and matching 0 out of 3 X, then matches V?I{0,3} by skipping the optional V and matching 1 out of 3 I. Then end of string. Whoa.
|
If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand
someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back
to your own regular expressions a few months later. I've done it, and it's not a pretty sight.
In the next section we'll explore an alternate syntax that can help keep your expressions maintainable.