Regular Expression
The star means “match anything,”
A question mark means “match any one character.”
Full regular expressions are composed of two types of characters. The special characters (like the * from the filename analogy) are called metacharacters, while the rest are called literal, or normal text characters.
It might help to consider regular expressions as their own language, with literal text acting as the words and metacharacters as the grammar. The words are combined with grammar according to a set of rules to create an expression that communicates an idea
Start and End of the Line :
Probably the easiest metacharacters to understand are ^ (caret) and $ (dollar), which represent the start and end, respectively, of the line of text as it is being checked. As we’ve seen, the regular expression cat finds c•a•t anywhere on the line, but ^cat matches only if the c•a•t is at the beginning of the linethe ^ is used to effectively anchor the match (of the rest of the regular expression) to the start of the line. Similarly, cat$ finds c•a•t only at the end of the line, such as a line ending with scat.
Matching any one of several character
Let’s say you want to search for “grey,” but also want to find it if it were spelled “gray.” The regular-expression construct [⋯], usually called a character class, lets you list the characters you want to allow at that point in the match. While e matches just an e, and a matches just an a, the regular expression [ea] matches either. So, then, consider gr[ea]y : this means to find “g, followed by r, followed by either an e or an a, all followed by y.”
Within a character class, the character-class metacharacter ‘-’ (dash) indicates a range of characters:
Negated character classes
[^1-6] matches a character that’s not 1 through 6. The leading ^ in the class “negates” the list, so rather than listing the characters you want to include in the class, you list the characters you don’t want to be included.
You might have noticed that the ^ used here is the same as the start-of-line caret introduced on page 8. The character is the same, but the meaning is completely different. Just as the English word “wind” can mean different things depending on the context (sometimes a strong breeze, sometimes what you do to a clock), so can a metacharacter. We’ve already seen one example, the range-building dash. It is valid only inside a character class (and at that, only when not first inside the class). ^ is a line anchor outside a class, but a class metacharacter inside a class (but, only when it is immediately after the class’s opening bracket; otherwise, it’s not special inside a class). Don’t fear these are the most complex special cases; others we’ll see later aren’t so bad.
Matching Any Character with Dot
The metacharacter [.] (usually called dot or point) is a shorthand for a character class that matches any character. It can be convenient when you want to have an “any character here” placeholder in your expression. For example, if you want to search for a date such as 03/19/76, 03-19-76, or even 03.19.76, you could go to the trouble to construct a regular expression that uses character classes to explicitly allow ‘/’, ‘-’, or ‘.’ between each number, such as [03[-./]19[-./]76]. However, you might also try simply using [03.19.76].
Quite a few things are going on with this example that might be unclear at first. In [03[-./]19[-./]76], the dots are not metacharacters because they are within a character class. (Remember, the list of metacharacters and their meanings are different inside and outside of character classes.) The dashes are also not class metacharacters in this case because each is the first thing after [ or [^. Had they not been first, as with [.-/], they would be the class range metacharacter, which would be a mistake in this situation.
Matching any one of several subexpressions
A very convenient metacharacter is |, which means “or.” It allows you to combine multiple expressions into a single expression that matches any of the individual ones. For example, Bob and Robert are separate expressions, but Bob|Robert is one expression that matches either. When combined this way, the subexpressions are called alternatives.
Looking back to our gr[ea]y example, it is interesting to realize that it can be written as grey|gray, and even gr(a|e)y. The latter case uses parentheses to constrain the alternation. (For the record, parentheses are metacharacters too.) Note that something like gr[a|e]y is not what we want within a class, the ‘|’ character is just a normal character, like a and e.
With gr(a|e)y, the parentheses are required because without them, gra|ey means “gra or ey,” which is not what we want here. Alternation reaches far, but not beyond parentheses. Another example is (First|1st)•[Ss]treet.[ ] Actually, since both First and 1st end with st, the combination can be shortened to (Fir|1)st • [Ss]treet. That’s not necessarily quite as easy to read, but be sure to understand that (first|1st) and (fir|1)st effectively mean the same thing.
[] Recall from the typographical conventions on page xxii that “•” is how I sometimes show a space character so it can be seen easily.
Here’s an example involving an alternate spelling of my name. Compare and contrast the following three expressions, which are all effectively the same:
Jeffrey|Jeffery
Jeff(rey|ery)
Jeff(re|er)y
To have them match the British spellings as well, they could be:
(Geoff|Jeff)(rey|ery)
(Geo|Je)ff(rey|ery)
(Geo|Je)ff(re|er)y
Finally, note that these three match effectively the same as the longer (but simpler) Jeffrey|Geoffery|Jeffery|Geoffrey. They’re all different ways to specify the same desired matches.
Although the gr[ea]y versus gr(a|e)y examples might blur the distinction, be careful not to confuse the concept of alternation with that of a character class. A character class can match just a single character in the target text. With alternation, since each alternative can be a full-fledged regular expression in and of itself, each alternative can match an arbitrary amount of text. Character classes are almost like their own special mini-language (with their own ideas about metacharacters, for example), while alternation is part of the “main” regular expression language. You’ll find both to be extremely useful.
Also, take care when using caret or dollar in an expression that has alternation. Compare ^From|Subject|Date:• with ^(From|Subject|Date):•. Both appear similar to our earlier email example, but what each matches (and therefore how useful it is) differs greatly. The first is composed of three alternatives, so it matches “^From or Subject or Date: •,” which is not particularly useful. We want the leading caret and trailing: • to apply to each alternative. We can accomplish this by using parentheses to “constrain” the alternation:
^(From;Subject;Date):•
The alternation is constrained by the parentheses, so literally, this regex means “match the start of the line, then one of From, Subject, or Date, and then match: •.” Effectively, it matches:
• 1) start-of-line, followed by F•r•o•m, followed by ‘: •’
• or 2) start-of-line, followed by S•u•b•j•e•c•t, followed by ‘: •’
• or 3) start-of-line, followed by D•a•t•e, followed by ‘: •’
Putting it less literally, it matches lines beginning with ‘From: •’, ‘Subject: •’, or ‘Date: •’, which is quite useful for listing the messages in an email file.
Ignoring Differences in Capitalization
This email header example provides a good opportunity to introduce the concept of a case-insensitive match. The field types in an email header usually appear with leading capitalization, such as “Subject” and “From,” but the email standard actually allows mixed capitalization, so things like “DATE” and “from” are also allowed. Unfortunately, the regular expression in the previous section doesn’t match those.
One approach is to replace From with [Ff][Rr][Oo][Mm] to match any form of “from,” but this is quite cumbersome, to say the least. Fortunately, there is a way to tell egrep to ignore case when doing comparisons, i.e., to perform the match in a case insensitive manner in which capitalization differences are simply ignored. It is not a part of the regular-expression language, but is a related useful feature many tools provide. egrep’s command-line option “-i” tells it to do a case-insensitive match. Place -i on the command line before the regular expression:
% egrep -i ‘^(From|Subject|Date): ‘ mailbox
This brings up all the lines we matched before, but also includes lines such as:
SUBJECT: MAKE MONEY FAST
I find myself using the -i option quite frequently (perhaps related to the footnote on page 12!) so I recommend keeping it in mind. We’ll see other convenient support features like this in later chapters.
Word Boundaries
A common problem is that a regular expression that matches the word you want can often also match where the “word” is embedded within a larger word. I mentioned this briefly in the cat, gray, and Smith examples. It turns out, though, that some versions of egrep offer limited support for word recognition: namely the ability to match the boundary of a word (where a word begins or ends).
You can use the (perhaps odd looking) metasequences \< and \> if your version happens to support them (not all versions of egrep do). You can think of them as word-based versions of ^ and $ that match the position at the start and end of a word, respectively. Like the line anchors caret and dollar, they anchor other parts of the regular expression but don’t actually consume any characters during a match. The expression \
Note that < and > alone are not metacharacters when combined with a back-slash, the sequences become special. This is why I called them “metasequences.” It’s their special interpretation that’s important, not the number of characters, so for the most part I use these two meta-words interchangeably.
Summary of Metacharacters Seen So Far.
Metacharacter Name Matches
. dot any one character
[⋯] character class any character listed
[^⋯] negated character class any character not listed
^ caret the position at the start of the line
$ dollar the position at the end of the line
\< backslash less-than the position at the start of a word
\> backslash greater-than the position at the end of a word
not supported by all versions of egrep
| or; bar matches either expression it separates
(⋯) parentheses used to limit scope of | , plus additional uses yet to be discussed