You could call this another in my series on the UNIX command line interface. ``Regular Expressions'' are the way to identify specific portions of text using symbolic patterns with only the ASCII character set. Identifying these portions of text enables you to perform either manual or automated editing functions easily on text files, such as UNIX configuration, data, or any other text file.
One of the most frequently used commands that uses regular expressions is ``grep,'' the ubiquitous UNIX search utility. GNU grep also answers to egrep and fgrep, which are the same as grep -E and grep -F, respectively. (See man grep, for more information.) Other tools include sed, vi, Emacs, gawk, Perl, Bash, Tcl, and many others.
The basic types of expressions provided for are alternation, concatination, grouping, closure, and some advanced expressions. Then there are about umteen million ways to use each of these, with their added syntax features as well. Finally, there are some oddities and idiosyncrasies worth noting.
Alternation. When you want to isolate either one
pattern or another pattern (or even more patterns), you use an
alternation expression. The alternation symbol for UNIX regular
expressions is |.
Example: grep -L '#!/bin/sh\|#!/usr/bin/perl' *
will find all Bourne shell scripts and perl scripts in the
current directory.
Or cat /var/log/ppp.log | grep -e 'fail\|failed' will
find all instances of a failed connection in your ppp log.
Concatenation. To match a simple text string and
nothing else is a conventional expression, but to match that same
string followed by another string is a concatentation. Example:
locate log | grep -E 'ppp\.log' will match
/var/log/ppp.log,
but locate log | grep -E 'ppp*log\.*' will match
ppp.log.0 and ppp.log.1 also,
as well as any gzipped logfiles that have been rotated, such
as ppp.log.3.gz. This latter expression is an example
of concatenation.
Grouping. Often it's convenient to express a group
of pattern matches so you can reuse the group somehow. Groups are
designated with parentheses in regular expressions. Example
:%s/Comedy of \(Errors\)/Tragic \1/g This Vi editor command
(the Vim editor, actually) searches through a document for ``Comedy
of Errors'' and replaces all instances of it with ``Tragic Errors''
instead. Grouping done in this way can be very useful.
Another more practical usage might be:
cat /etc/group | grep [a-z]$ | sed 's/\(.*\):\(.*\):\(.*\):\(.*\)/\1--\>\4/' | sort | more This is called ``backreferrencing.'' Each of the
parenthetical expressions are groups, and each group is remembered by
sed and can be called by using its positional variable. The first
group is called by \1, the second by \2, and so on.
This command looks through the /etc/group file and finds
only those lines with members in the group (the lines don't end
with a colon), then pipes those lines and displays the first
and fourth groups, separated by an arrow. The ``\1''
and ``\4'' are backreferences to the first and fourth
parenthetical groups. (Lines that don't end with a colon are the
only groups with members already in them.)
Closure. We've already seen the principle of
closure in action above. Closure symbols are the following
operators: ? + and * These metacharacters describe the occurance
of the character immediately before it in an expression. The
question mark means ``zero or once,'' the plus sign means ``one
or more,'' and the asterisk means ``zero or many times.'' So,
ab+ means ``a'' with one or more ``b's'' after it.
(ab)+ means one or more instances of ab together.
abcd?e would match abce as well as abcde, or
even abcd9e, if it were in the same directory. And abcd* would match abcdefg..., essentially abcd
followed by anything. The expression completes by closure using
these metacharacters.
Advanced Expressions. Most metacharacters mentioned above
do not retain their special meanings inside the square brackets [
and ]. But some do retain special meanings, namely ^ and
$. So, if you wanted to search for a bunch of metacharacters, you couldn't
use: [^&*$@], but you could use [@$^&*]. You just have to pay close attention to how the regular expression
gets interpreted, namely, one character at a time. It's easy to be
ambiguous if you're not careful.
Another advanced expression would be counting expressions, whether
individual characters or groups of expressions. We use the curly
braces for these: { and }. We can specify how many
instances of a pattern we wish to isolate. The previous sed
example could be shortened to
sed -e 's/\([a-z]+:\)\{3\}:\([a-z]$\)/\1--\>\2/g'. Notice
that the curly braces are escaped here.
Another example: perhaps you discovered a peculiar mistake that had
been duplicated throughout some documents, such that only a computer
could create or fix it. You might type grep -e '^M\{4\}' to
find four consecutive carriage return/line feed combinations (line
returns appear differently in various environments, so you might
have to use a different search key for yours). Notice that each
curly brace is ``escaped'' so the shell doesn't interpret them
prematurely.
But perhaps several consecutive line returns are correct in your
documentation style guide under certain circumstances. You can
further refine your search by adding more counting expressions (also
called ranges or intervals. For example:
grep -e '^M\{3,5\}' will find more than three consecutive
line returns but not more than five. So using a pair of values
inside the curly braces, you can specify a range of instances
for your expression to match.
Of course, you can use multiple expression combinations, too. For
example, if your style guide specified acronyms must be defined on
the first instance, this expression could find those that were not:
grep -e '\(\<[A-Z]\>+ [^(]\)'. You can get much more
sophisticated than this in your document checks, but you'll probably
want to collect your commands into a script.
Oddities. You should be aware that in the UNIX world, not all tools obey the same rules when it comes to regular expressions. There are simply different rules as to what is ``regular'' between tools. In particular, grouping syntax and when to escape certain metacharacters are not consistent between all regular expression tools.
Egrep, for example, is implemented differently on several different
platforms. Some versions accept the escaped operators of sed
and vi, but others do not. Also, in general regex notation,
the asterisk closure operator will not work after \).
Sometimes egrep is more liberal than regular expressions in general:
(word)|(phrase) is fine with egrep but is hard to represent
in regular expression syntax, because the escaping notation
sometimes confuses tools.
Still, even if it does take several practice runs to get an expression just right, once you experience the power of regular expressions, you won't be able to live without them in your life at the command prompt. Tasks that used to take ages to struggle through to accomplish will melt before you and become almost trivial. Regular expressions are a tool you won't want to be without.
Several sources exist for learning more about them: