7.6. Case study: parsing phone numbers
So far we've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions
are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.
This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American
phone number. Of course the client wanted the number to be entered free-form (in a single field), but then wanted to store
the area code, trunk, number, and optionally an extension separately in their database. I scoured the web and found many
examples of regular expressions that purported to do this, and none of them were permissive enough.
Here are some of the phone numbers I'd like to be able to accept:
- 800-555-1212
- 800 555 1212
- 800.555.1212
- (800) 555-1212
- 1-800-555-1212
- 800-555-1212-1234
- 800-555-1212x1234
- 800-555-1212 ext. 1234
- work 1-(800) 555.1212 #1234
Quite a variety! In each of these cases, I need to know that the area code was 800, the trunk was 555, and the rest of the phone number was 1212. For those with an extension, I need to know that the extension was 1234.
Example 7.11. Finding numbers
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212')
>>> phonePattern.search('800-555-1212-1234')
>>>
|
Always read regular expressions from left to right. This one matches the beginning of the string, then (\d{3}). We already know that \d{3} means “match exactly 3 numeric digits”. Putting it in parentheses means “match exactly 3 numeric digits, and then remember them as a group that I can ask for later”. Then match a literal hyphen. Then match another group of exactly 3 digits. Then another literal hyphen. Then another
group of exactly 4 digits. Then end of string.
|
|
To get access to the groups that the regular expression parser remembered along the way, use the groups() method on the object that the search function returns. It will return a tuple of however many groups were defined in the regular expression. In this case, we
defined three groups, one with 3 digits, one with 3 digits, and one with 4 digits.
|
|
This regular expression is not our final answer, though, because it doesn't handle a phone number with an extension on the
end. For that, we'll need to expand our regular expression.
|
Example 7.12. Finding the extension
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')
>>> phonePattern.search('800-555-1212-1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800 555 1212 1234')
>>>
>>> phonePattern.search('800-555-1212')
>>>
|
OK, this regular expression is almost identical to the previous one. Just as before, we match the beginning of the string,
then a remembered group of 3 digits, then a hyphen, then a remembered group of 3 digits, then a hyphen, then a remembered
group of 4 digits. What's new is that we then match another hyphen, and a remembered group of 1 or more digits, then end
of string.
|
|
The groups() method now returns a tuple of 4 elements, since our regular expression now defines 4 groups to remember.
|
|
Unfortunately, this regular expression is not our final answer either, because it assumes that the different parts of the
phone number are separated by hyphens. What if they're separated by spaces, or commas, or dots? We need a more general solution
to match several different types of separators.
|
|
Oops! Not only does this regular expression not do everything we want, it's actually a step backwards, because now we can't
parse phone numbers without an extension. That's not what we wanted at all; if the extension is there, we want to know what it is, but if it's not there,
we still want to know what the different parts of the main number are.
|
Example 7.13. Handling different separators
>>> phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')
>>> phonePattern.search('800 555 1212 1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212-1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('80055512121234')
>>>
>>> phonePattern.search('800-555-1212')
>>>
|
OK, hang on to your hat. We're matching the beginning of the string, then a group of 3 digits, then \D+. What the heck is that? Well, \D matches any character except a numeric digit, and + means “1 or more”. So \D+ matches 1 or more characters that are not digits. This is what we're using now instead of a literal hyphen, to try to match
different separators.
|
|
Using \D+ instead of - means we can now match phone numbers where the parts are separated by spaces instead of hyphens.
|
|
Of course, phone numbers separated by hyphens still work too. |
|
Unfortunately, this is still not our final answer, because it assumes that there is a separator at all. What if the phone
number is entered without any spaces or hyphens at all?
|
|
Oops! We still haven't fixed the problem of requiring extensions. Now we have two problems, but we can solve both of them
with the same technique.
|
Example 7.14. Handling no separators
>>> phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('80055512121234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800.555.1212 x1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '')
>>> phonePattern.search('(800)5551212 x1234')
>>>
|
The only change we've made since that last step is changing all the + to *. Instead of \D+ between the parts of the phone number, we now match on \D*. Remember that + means “1 or more”? Well, * means “0 or more”. So now we should be able to parse phone numbers even when there is no separator character at all.
|
|
Lo and behold, it actually works. Why? We matched the beginning of the string, then a remembered group of 3 digits (800), then 0 non-numeric characters, then a remembered group of 3 digits (555), then 0 non-numeric characters, then a remembered group of 4 digits (1212), then 0 non-numeric characters, then a remembered group of an arbitrary number of digits (1234), then end of string.
|
|
Other variations work now too. Dots instead of hyphens, and both a space and an x before the extension.
|
|
Finally, we've solved our other long-standing problem: extensions are optional again. If no extension is found, the groups() method still returns a tuple of 4 elements, but the fourth element is just an empty string.
|
|
I hate to be the bearer of bad news, but we're not done yet. What's the problem here? There's an extra character before
the area code, but our regular expression assumes that the area code is the first thing at the beginning of the string. No
problem, we can use the same technique of “0 or more non-numeric characters” to skip over the leading characters before the area code.
|
Example 7.15. Handling leading characters
>>> phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('(800)5551212 ext. 1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '')
>>> phonePattern.search('work 1-(800) 555.1212 #1234')
>>>
|
Same as before, except now we're matching \D*, 0 or more non-numeric characters, before the first remembered group (the area code). Note that we're not remembering these
non-numeric characters (they're not in parentheses). If we find them, we'll just skip over them and then start remembering
the area code whenever we get to it.
|
|
OK, we can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis
after the area code is already handled; it's treated as a non-numeric separator and matched by the \D* after the first remembered group.)
|
|
Just a sanity check to make sure we haven't broken anything that used to work. Since the leading characters are entirely
optional, this matches the beginning of the string, then 0 non-numeric characters, then a remembered group of 3 digits (800), then 1 non-numeric character (the hyphen), then a remembered group of 3 digits (555), then 1 non-numeric character (the hyphen), then a remembered group of 4 digits (1212), then 0 non-numeric characters, then a remembered group of 0 digits, then end of string.
|
|
This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn't this phone number match?
Because there's a 1 before the area code, but we assumed that all the leading characters before the area code were non-numeric characters (\D*). Aargh.
|
Let's back up for a second. So far our regular expressions have all matched from the beginning of the string. But now we
see that there may be an indeterminate amount of stuff at the beginning of the string that we want to ignore. Rather than
trying to match it all just so we can skip over it, let's take a different approach: don't explicitly match the beginning
of the string at all.
Example 7.16. Phone number, wherever I may find ye
>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')
>>> phonePattern.search('80055512121234')
('800', '555', '1212', '1234')
|
Note the lack of ^ in this regular expression. We are not matching the beginning of the string anymore. There's nothing that says you have
to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out
where the input string starts to match, and go from there.
|
|
Now we can successfully parse a phone number that includes leading characters and a leading digit, plus any number of any
kind of separators around each part of the phone number.
|
|
Sanity check. this still works. |
|
That still works too. |
See how quickly our regular expression got out of control? Take a quick glance at any of the previous iterations. Can you
tell the difference between one and the next? While we still understand our final answer (and it is our final answer, if
you've discovered a case it doesn't handle, I don't want to know about it), let's write it out as a verbose regular expression,
before we forget why we made the choices we made.
Example 7.17. Parsing phone numbers (final version)
>>> phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555')
\D* # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212')
\D* # optional separator
(\d*) # extension is optional and can be any number of digits
$ # end of string
''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')
|
Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it's no
surprise that it parses the same inputs.
|
|
Final sanity check. Yes, this still works. I think we're done. |