Regular Expressions
This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions.
Testing
- The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this.
echo "input" | perl -n -e '/regexpHere/ and print'
- To print a match (and not the whole line), use:
echo "input" | perl -n -e '/(regexpHere)/ and print $1'
- You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).
Examples
- To find a <li> followed by a <ul> without a </li> element between (general note: you should very rarely use regular expressions to parse HTML):
/(?s)<li((?!<\/li).)*?<ul>/
Basics
Operators
- ^ represents the start of the string to search, or a newline in multiline mode. It is optional. Note that if you use syntax such as '^.*', you can just remove it altogether: it's better just not to use ^ in this case.
- $ delimits the end of the string or just before the newline in multiline mode. It is optional.
- *? is the non-greedy operator, it will match as little text as possible. +?, ??, or {m,n}? are also available.
- ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
- {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal.
- (?!...) is for a negative look ahead. Generally hard to use.
Flags
- Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).
Groups
- The special group 0 corresponds to the entire pattern.
- In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<name> to the group, by its name. To create a group which will not be available later for retrieval, write (?:expression).
- Example:
- regularExpression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
- mathObject = regularExpression.search(query_string)
- We could access the price value by mathObject.group(1) or mathObject.group('Price'). For the quantity, it would be group(2) or group('Quantity').
Quoting part of a regular expression
- This can be done via the \Q and \E constructs (you nest the literal inside those). This only seems to work in Java; at least in JavaScript, it is not available.
Regular expressions in Java
- Documentation for the Pattern and Matcher classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations.
- If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
- The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
- Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.
Manipulating results
- You call the group(groupNumber) method on a matcher to retrieve the value of a group match.
- There is no method for directly replacing the contents of a group submatch, but it is easy to write code such as:
value = value.substring(0, matcher.start(1)) + "#" + value.substring(matcher.end(1));
Warnings
- Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs.
Regular expressions in Groovy
- Groovy has the following shortcuts:
- ==~ for matches().
- =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like
if ("hello" =~ /hel/)
Be careful to include parenthesis in the following case:
if (! ("hello" =~ /hal/))
- A pattern can be directly created via ~/foo/.
Escaping
- If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance:
/(url\(\"http:\/\/(.*?))\"/
- List of characters to escape:
- double quotes,
- slashes,
- parenthesis.
- If you use a regular expression in String mode, and you wish to match a literal $, you must use three slashes:
myString.replaceAll("\\\$\\{variable\\}", "something")
- It is NOT necessary to escape single quotes.
Regular expressions in JavaScript
- Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:
[\s\S]
rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freeze the JS execution engine on complex matches.
- You can use the following function to quote a literal string for a regular expression (same as using \Q and \E in Java):
regularExpressionLiteral = regularExpressionLiteral(new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");
Regular expressions in Python
Official documentation available here.
Basic Operations
You can perform two basic operations: search and match. In Perl, search is always used.
- match(): Determine if the RE matches at the beginning of the string.
- search(): Scan through a string, looking for any location where this RE matches.
Regular Expressions in PHP
Things worth keeping in mind:
- It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions.
- Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern.
Functions
- Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc...
preg_match($pattern, $data, $matches);
- Use preg_replace if you want to replace some data inside a string.