Regular Expressions

From Elvanör's Technical Wiki
Revision as of 20:49, 4 December 2011 by Elvanor (talk | contribs) (→‎Regular expressions in JavaScript)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions.

Testing

  • The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this.
echo "input" | perl -n -e '/regexpHere/ and print'
  • To print a match (and not the whole line), use:
echo "input" | perl -n -e '/(regexpHere)/ and print $1'
  • You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).

Examples

  • To find a <li> followed by a <ul> without a </li> element between (general note: you should very rarely use regular expressions to parse HTML):
 /(?s)<li((?!<\/li).)*?<ul>/

Basics

Operators

  • ^ represents the start of the string to search, or a newline in multiline mode. It is optional. Note that if you use syntax such as '^.*', you can just remove it altogether: it's better just not to use ^ in this case.
  • $ delimits the end of the string or just before the newline in multiline mode. It is optional.
  • *? is the non-greedy operator, it will match as little text as possible. +?, ??, or {m,n}? are also available.
  • ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
  • {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal.
  • (?!...) is for a negative look ahead. Generally hard to use.

Flags

  • Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).

Groups

  • The special group 0 corresponds to the entire pattern.
  • In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<name> to the group, by its name. To create a group which will not be available later for retrieval, write (?:expression).
  • Example:
    • regularExpression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
    • mathObject = regularExpression.search(query_string)
    • We could access the price value by mathObject.group(1) or mathObject.group('Price'). For the quantity, it would be group(2) or group('Quantity').

Quoting part of a regular expression

  • This can be done via the \Q and \E constructs (you nest the literal inside those). This only seems to work in Java; at least in JavaScript, it is not available.

Regular expressions in Java

  • Documentation for the Pattern and Matcher classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations.
  • If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
  • The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
  • Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.

Manipulating results

  • You call the group(groupNumber) method on a matcher to retrieve the value of a group match.
  • There is no method for directly replacing the contents of a group submatch, but it is easy to write code such as:
value = value.substring(0, matcher.start(1)) + "#" + value.substring(matcher.end(1));

Warnings

  • Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs.

Regular expressions in Groovy

  • Groovy has the following shortcuts:
    • ==~ for matches().
    • =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like
if ("hello" =~ /hel/)

Be careful to include parenthesis in the following case:

if (! ("hello" =~ /hal/))
  • A pattern can be directly created via ~/foo/.

Escaping

  • If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance:
/(url\(\"http:\/\/(.*?))\"/
  • List of characters to escape:
    • double quotes,
    • slashes,
    • parenthesis.
  • If you use a regular expression in String mode, and you wish to match a literal $, you must use three slashes:
myString.replaceAll("\\\$\\{variable\\}", "something")
  • It is NOT necessary to escape single quotes.

Regular expressions in JavaScript

  • Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:
[\s\S]

rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freeze the JS execution engine on complex matches.

  • You can use the following function to quote a literal string for a regular expression (same as using \Q and \E in Java):
regularExpressionLiteral = regularExpressionLiteral(new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");

Regular expressions in Python

Official documentation available here.

Basic Operations

You can perform two basic operations: search and match. In Perl, search is always used.

  • match(): Determine if the RE matches at the beginning of the string.
  • search(): Scan through a string, looking for any location where this RE matches.

Regular Expressions in PHP

Things worth keeping in mind:

  • It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions.
  • Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern.

Functions

  • Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc...
preg_match($pattern, $data, $matches);
  • Use preg_replace if you want to replace some data inside a string.