15.2. plural.py, stage 1

So we're looking at words, which at least in English are strings of characters. And we have rules that say we need to find different combinations of characters, and then do different things to them. This sounds like a job for regular expressions.

Example 15.1. plural1.py


import re

def plural(noun):                            
    if re.search('[sxz]$', noun):             1
        return re.sub('$', 'es', noun)        2
    elif re.search('[^aeioudgkprt]h$', noun):
        return re.sub('$', 'es', noun)       
    elif re.search('[^aeiou]y$', noun):      
        return re.sub('y$', 'ies', noun)     
    else:                                    
        return noun + 's'                    
1 OK, this is a regular expression, but it uses a syntax we didn't see in Regular Expressions. The square brackets mean “match exactly one of these characters”. So [sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. So we're checking to see if noun ends with s, x, or z.
2 This re.sub function is totally new, in the sense that we never covered it in the regular expressions chapter. It performs regular expression-based string substitutions. Let's look at it in more detail.

Example 15.2. Introducing re.sub

>>> import re
>>> re.search('[abc]', 'Mark')   1
<_sre.SRE_Match object at 0x001C1FA8>
>>> re.sub('[abc]', 'o', 'Mark') 2
'Mork'
>>> re.sub('[abc]', 'o', 'rock') 3
'rook'
>>> re.sub('[abc]', 'o', 'caps')  4
'oops'
1 Does the string Mark contain a, b, or c? Yes, it contains a.
2 OK, now find a, b, or c, and replace it with o. Mark becomes Mork.
3 The same function turns rock into rook.
4 You might think this would turn caps into oaps, but it doesn't. re.sub replaces all of the matches, not just the first one. So this regular expression turns caps into oops, because both the c and the a get turned into o.

Example 15.3. Back to plural1.py


import re

def plural(noun):                            
    if re.search('[sxz]$', noun):            
        return re.sub('$', 'es', noun)        1
    elif re.search('[^aeioudgkprt]h$', noun): 2
        return re.sub('$', 'es', noun)        3
    elif re.search('[^aeiou]y$', noun):      
        return re.sub('y$', 'ies', noun)     
    else:                                    
        return noun + 's'                    
1 Back to our plural function. What are we doing? We're replacing the end of string with es. In other words, adding es to the string. We could accomplish the same thing with string concatenation, for example noun + 'es', but I'm using regular expressions for everything, for consistency, for reasons that will become clear later in the chapter.
2 Look closely, this is another new variation. The ^ as the first character inside the square brackets means something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt] means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h, followed by end of string. We're looking for words that end in H where the H can be heard.
3 Same pattern here: match words that end in Y, where the character before the Y is not a, e, i, o, or u. We're looking for words that end in Y that sounds like I.

Example 15.4. More on negation regular expressions

>>> import re
>>> re.search('[^aeiou]y$', 'vacancy') 1
<_sre.SRE_Match object at 0x001C1FA8>
>>> re.search('[^aeiou]y$', 'boy')     2
>>> 
>>> re.search('[^aeiou]y$', 'day')
>>> 
>>> re.search('[^aeiou]y$', 'pita')    3
>>> 
1 vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u.
2 boy does not match, because it ends in oy, and we specifically said that the character before the y could not be o. day does not match, because it ends in ay.
3 pita does not match, because it does not end in y.

Example 15.5. More on re.sub

>>> re.sub('y$', 'ies', 'vacancy')              1
'vacancies'
>>> re.sub('y$', 'ies', 'agency')
'agencies'
>>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy') 2
'vacancies'
1 This regular expression turns vacancy into vacancies and agency into agencies, which is what we wanted. Note that it would also turn boy into boies, but that will never happen in our function because we did that re.search first to find out whether we should do this re.sub.
2 Just in passing, I want to point out that it is possible to combine these two regular expressions (one to find out if the rule applies, and another to actually apply it) into a single regular expression. Here's what that would look like. Most of it should look familiar: we're using a remembered group, which we learned in Case study: parsing phone numbers, to remember the character before the y. Then in the substitution string, we use a new syntax, \1, which means “hey, that first group you remembered? put it here”. In this case, we remember the c before the y, and then when we do the substitution, we substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can use \2 and \3 and so on.)

Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn't directly map to the way we first described the pluralizing rules. We originally laid out rules like “if the word ends in S, X, or Z, then add ES”. And if you look at this function, we have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn't get much more direct than that.