2010-06-06

Regular Expressions

A regular expression (or regex, regexp, r.e.) is a pattern that matches a string or a portion of a string. They are used for validation, search, and find/replace. You'll find them in Java, JavaScript, Ruby, Python, Perl, and dozens of other popular languages. All languages agree on a common notation core, e.g. square brackets for character classes, ? for zero-or-one, * for zero or more, + for one-or-more, parentheses for grouping and capturing, and so on.

There's room for some notational innovations here, though. We'll cover a little of what's already in use and sketch some new ideas, too.

Examples from the Common Notation

Virtually all languages will interpret these the same way:

hello
gray|grey
gr(a|e)y
gr[ae]y
colou?r
go*gle
go+gle
g(oo)+gle
z{3}
z{3,6}
z{3,}
[Bb]rainf\*\*k
\d{5}(-\d{4})?
1\d{10}
[2-9]|[12]\d|3[0-6]
Hello\nworld
b..b
\d+(\.\d\d)?
^dog
dog$
^dog$
sh[^i]t
\d+(\.\d+([Ee][+-]?\d+)?)?
https?://(?:www\.)citysearch\.com/profile/[^/]+/(\d+)/?

Metacharacters and Escapes

Most characters stand for themsleves in a pattern. The ones that don't are called metacharacters. Most languages have 14 metacharacters (some more, some less). The usual ones are:

    |    ?    *    +    .    ^    $
    (    )    [    ]    {    }    \

If you want to use a metacharacter as a regular character, you have to escape it, e.g. \+, \[, \\, etc. The backslash also introduces dozens of other simplified expressions:

...
... TODO ...
...

Defining Patterns

How do you write a regex, such as \d+(\.\d+)? in program text? How about in a string literal?

'\d+(\.\d+)?'

Maybe.... The thing is most languages use \ to escape within a string literal. Perl and Ruby don't make the backslash special if the string literal has single quotes, but they do for double quotes, like most languages. In that case, you have to write:

"\\d+(\\.\\d+)?"

And if you need to write a regex that has to match a backslash character, the regex is \\, which looks like this in a string literal:

"\\\\"

But a regex is not a string: it has to be compiled into a little program that matches:

var r = re.compile("\\d+(\\.\\d+)?")
r = "\\d+(\\.\\d+)?".toregex
Pattern p = Pattern.compile("\\d+(\\.\\d+)?");

But since regexes are so important, and that backslash issue gets so annoying, we'd like a language to have a special syntax for describing regexes—something that automatically compiles once, and frees us from doubly escaping backslashes. Common notations include Ruby's %r
and slash delimiters:

%r{\d+(\.\d+)?}
/\d+(\.\d+)?/

The problem with the slash approach is that you have to escape any / characters you want in the regex itself. The %r approach has so such trouble; the language's parser can figure out which is the terminating } easily.

Matchers

To use a regex you match it against some text. This allows you to

  • Validate that the text conforms to a desired pattern
  • Find (and extract) the parts of the text that do conform
  • Replace matching portions of the text with something else

So these matches are stateful ... TODO ... Once you have a (compiled, immutable) pattern, you use it by creating a (stateful) matcher.

Character Classes

Quantifiers

Groups and Capturing

Backreferences

Boundaries

Lookaround

Modifiers

No comments:

Post a Comment