Regular Expressions

You could call this another in my series on the UNIX command line interface. ``Regular Expressions'' are the way to identify specific portions of text using symbolic patterns with only the ASCII character set. Identifying these portions of text enables you to perform either manual or automated editing functions easily on text files, such as UNIX configuration, data, or any other text file.

One of the most frequently used commands that uses regular expressions is ``grep,'' the ubiquitous UNIX search utility. GNU grep also answers to egrep and fgrep, which are the same as grep -E and grep -F, respectively. (See man grep, for more information.) Other tools include sed, vi, Emacs, gawk, Perl, Bash, Tcl, and many others.

The basic types of expressions provided for are alternation, concatination, grouping, closure, and some advanced expressions. Then there are about umteen million ways to use each of these, with their added syntax features as well. Finally, there are some oddities and idiosyncrasies worth noting.

Alternation. When you want to isolate either one pattern or another pattern (or even more patterns), you use an alternation expression. The alternation symbol for UNIX regular expressions is |. Example: grep -L '#!/bin/sh\|#!/usr/bin/perl' * will find all Bourne shell scripts and perl scripts in the current directory. Or cat /var/log/ppp.log | grep -e 'fail\|failed' will find all instances of a failed connection in your ppp log.

Concatenation. To match a simple text string and nothing else is a conventional expression, but to match that same string followed by another string is a concatentation. Example: locate log | grep -E 'ppp\.log' will match /var/log/ppp.log, but locate log | grep -E 'ppp*log\.*' will match ppp.log.0 and ppp.log.1 also, as well as any gzipped logfiles that have been rotated, such as ppp.log.3.gz. This latter expression is an example of concatenation.

Grouping. Often it's convenient to express a group of pattern matches so you can reuse the group somehow. Groups are designated with parentheses in regular expressions. Example :%s/Comedy of \(Errors\)/Tragic \1/g This Vi editor command (the Vim editor, actually) searches through a document for ``Comedy of Errors'' and replaces all instances of it with ``Tragic Errors'' instead. Grouping done in this way can be very useful.

Another more practical usage might be: cat /etc/group | grep [a-z]$ | sed 's/\(.*\):\(.*\):\(.*\):\(.*\)/\1--\>\4/' | sort | more This is called ``backreferrencing.'' Each of the parenthetical expressions are groups, and each group is remembered by sed and can be called by using its positional variable. The first group is called by \1, the second by \2, and so on. This command looks through the /etc/group file and finds only those lines with members in the group (the lines don't end with a colon), then pipes those lines and displays the first and fourth groups, separated by an arrow. The ``\1'' and ``\4'' are backreferences to the first and fourth parenthetical groups. (Lines that don't end with a colon are the only groups with members already in them.)

Closure. We've already seen the principle of closure in action above. Closure symbols are the following operators: ? + and * These metacharacters describe the occurance of the character immediately before it in an expression. The question mark means ``zero or once,'' the plus sign means ``one or more,'' and the asterisk means ``zero or many times.'' So, ab+ means ``a'' with one or more ``b's'' after it. (ab)+ means one or more instances of ab together. abcd?e would match abce as well as abcde, or even abcd9e, if it were in the same directory. And abcd* would match abcdefg..., essentially abcd followed by anything. The expression completes by closure using these metacharacters.

Advanced Expressions. Most metacharacters mentioned above do not retain their special meanings inside the square brackets [ and ]. But some do retain special meanings, namely ^ and $. So, if you wanted to search for a bunch of metacharacters, you couldn't use: [^&*$@], but you could use [@$^&*]. You just have to pay close attention to how the regular expression gets interpreted, namely, one character at a time. It's easy to be ambiguous if you're not careful.

Another advanced expression would be counting expressions, whether individual characters or groups of expressions. We use the curly braces for these: { and }. We can specify how many instances of a pattern we wish to isolate. The previous sed example could be shortened to sed -e 's/\([a-z]+:\)\{3\}:\([a-z]$\)/\1--\>\2/g'. Notice that the curly braces are escaped here.

Another example: perhaps you discovered a peculiar mistake that had been duplicated throughout some documents, such that only a computer could create or fix it. You might type grep -e '^M\{4\}' to find four consecutive carriage return/line feed combinations (line returns appear differently in various environments, so you might have to use a different search key for yours). Notice that each curly brace is ``escaped'' so the shell doesn't interpret them prematurely.

But perhaps several consecutive line returns are correct in your documentation style guide under certain circumstances. You can further refine your search by adding more counting expressions (also called ranges or intervals. For example: grep -e '^M\{3,5\}' will find more than three consecutive line returns but not more than five. So using a pair of values inside the curly braces, you can specify a range of instances for your expression to match.

Of course, you can use multiple expression combinations, too. For example, if your style guide specified acronyms must be defined on the first instance, this expression could find those that were not: grep -e '\(\<[A-Z]\>+ [^(]\)'. You can get much more sophisticated than this in your document checks, but you'll probably want to collect your commands into a script.

Oddities. You should be aware that in the UNIX world, not all tools obey the same rules when it comes to regular expressions. There are simply different rules as to what is ``regular'' between tools. In particular, grouping syntax and when to escape certain metacharacters are not consistent between all regular expression tools.

Egrep, for example, is implemented differently on several different platforms. Some versions accept the escaped operators of sed and vi, but others do not. Also, in general regex notation, the asterisk closure operator will not work after \). Sometimes egrep is more liberal than regular expressions in general: (word)|(phrase) is fine with egrep but is hard to represent in regular expression syntax, because the escaping notation sometimes confuses tools.

Still, even if it does take several practice runs to get an expression just right, once you experience the power of regular expressions, you won't be able to live without them in your life at the command prompt. Tasks that used to take ages to struggle through to accomplish will melt before you and become almost trivial. Regular expressions are a tool you won't want to be without.

Several sources exist for learning more about them:

1.
Mastering Regular Expressions by Jeffrey Friedl from O'Reilly publishers is indispensable when it comes to regular expressions.

2.
sed & awk by Dale Dougherty and Arnold Robbins, also from O'Reilly, contains a lot of great info on regular expressions.

3.
Nearly any of the Perl books from O'Reilly will have some material on regular expressions. In fact, Perl is a great way to learn and use regular expressions, because it has specific and consistent operators for representing expressions. It's often easier to write a Perl script to do a job with regular expressions than it is to figure out why your egrep expression isn't working.

4.
Any book on any of the UNIX shell interpreters will probably talk about regular expressions in some general level of detail.



David S. Jackson dsj@dsj.net