Regular Expressions: Difference between revisions

From Elvanör's Technical Wiki
Jump to navigation Jump to search
No edit summary
 
(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions.  
This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions.  


== Testing ==
= Testing =


* The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this.
* The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this.


  echo "input" | perl -n -e '/regexpHere/ and print'
  echo "input" | perl -n -e '/regexpHere/ and print'
* To print a match (and not the whole line), use:
echo "input" | perl -n -e '/(regexpHere)/ and print $1'


* You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).
* You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).
== Examples ==
* To find a <nowiki><li> followed by a <ul> without a </li></nowiki> element between (general note: you should very rarely use regular expressions to parse HTML):
<pre>
/(?s)<li((?!<\/li).)*?<ul>/
</pre>
= Basics =


== Operators ==
== Operators ==
Line 15: Line 28:
* *? is the non-greedy operator, it will match ''as little text as possible.'' +?, ??, or {m,n}? are also available.
* *? is the non-greedy operator, it will match ''as little text as possible.'' +?, ??, or {m,n}? are also available.
* ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
* ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
* {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal.
* (?!...) is for a negative look ahead. Generally hard to use.
== Flags ==
* Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).
== Groups ==
* The special group 0 corresponds to the entire pattern.
* In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<''name''> to the group, by its name. To create a group which will not be available later for retrieval, write (?:''expression'').


== Regular expressions in Python ==
* Example:
** regularExpression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
** mathObject = regularExpression.search(query_string)
** We could access the price value by mathObject.group(1) or mathObject.group('Price'). For the quantity, it would be group(2) or group('Quantity').


[http://docs.python.org/lib/module-re.html Official documentation available here.]
== Quoting part of a regular expression ==


=== Basic Operations ===
* This can be done via the \Q and \E constructs (you nest the literal inside those). This only seems to work in Java; at least in JavaScript, it is not available.


You can perform two basic operations: ''search'' and ''match''. In Perl, ''search'' is always used.
= Regular expressions in Java =


* match(): Determine if the RE matches at the beginning of the string.
* Documentation for the [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html Pattern] and [http://java.sun.com/javase/6/docs/api/java/util/regex/Matcher.html Matcher] classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations.
* search(): Scan through a string, looking for any location where this RE matches.


=== Groups ===
* If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
* The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
* Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.


In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<''name''> to the group, by its name. To create a group which will not be available later for retrieval, write (?:''expression'').
== Manipulating results ==


* Example:
* You call the group(groupNumber) method on a matcher to retrieve the value of a group match.
** regular_expression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
* There is no method for directly replacing the contents of a group submatch, but it is easy to write code such as:
** math_object = regular_expression.search(query_string)
** We could access the price value by math_object.group(1) or math_object.group('Price'). For the quantity, it would be group(2) or group('Quantity').


== Regular expressions in Java ==
value = value.substring(0, matcher.start(1)) + "#" + value.substring(matcher.end(1));


* Documentation for the [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html Pattern] and [http://java.sun.com/javase/6/docs/api/java/util/regex/Matcher.html Matcher] classes.
== Warnings ==


* The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
* Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs.


== Regular expressions in Groovy ==
= Regular expressions in Groovy =


* [http://groovy.codehaus.org/Regular+Expressions Documentation.]
* [http://groovy.codehaus.org/Regular+Expressions Documentation.]
Line 49: Line 75:
** ==~ for matches().
** ==~ for matches().
** =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like  
** =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like  
  if ( "hello" =~ /hel/)
  if ("hello" =~ /hel/)
Be careful to include parenthesis in the following case:
if (! ("hello" =~ /hal/))


* A pattern can be directly created via ~/foo/.
* A pattern can be directly created via ~/foo/.
== Escaping ==
* If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance:
/(url\(\"http:\/\/(.*?))\"/
* List of characters to escape:
** double quotes,
** slashes,
** parenthesis.
* If you use a regular expression in String mode, and you wish to match a literal $, you must use three slashes:
myString.replaceAll("\\\$\\{variable\\}", "something")
* It is '''NOT necessary''' to escape single quotes.
= Regular expressions in JavaScript =
* Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:
[\s\S]
rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freeze the JS execution engine on complex matches.
* You can use the following function to quote a literal string for a regular expression (same as using \Q and \E in Java):
regularExpressionLiteral = regularExpressionLiteral(new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");
= Regular expressions in Python =
[http://docs.python.org/lib/module-re.html Official documentation available here.]
== Basic Operations ==
You can perform two basic operations: ''search'' and ''match''. In Perl, ''search'' is always used.
* match(): Determine if the RE matches at the beginning of the string.
* search(): Scan through a string, looking for any location where this RE matches.
= Regular Expressions in PHP =
Things worth keeping in mind:
* It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions.
* Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern.
== Functions ==
* Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc...
preg_match($pattern, $data, $matches);
* Use preg_replace if you want to replace some data inside a string.

Latest revision as of 20:49, 4 December 2011

This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions.

Testing

  • The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this.
echo "input" | perl -n -e '/regexpHere/ and print'
  • To print a match (and not the whole line), use:
echo "input" | perl -n -e '/(regexpHere)/ and print $1'
  • You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).

Examples

  • To find a <li> followed by a <ul> without a </li> element between (general note: you should very rarely use regular expressions to parse HTML):
 /(?s)<li((?!<\/li).)*?<ul>/

Basics

Operators

  • ^ represents the start of the string to search, or a newline in multiline mode. It is optional. Note that if you use syntax such as '^.*', you can just remove it altogether: it's better just not to use ^ in this case.
  • $ delimits the end of the string or just before the newline in multiline mode. It is optional.
  • *? is the non-greedy operator, it will match as little text as possible. +?, ??, or {m,n}? are also available.
  • ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
  • {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal.
  • (?!...) is for a negative look ahead. Generally hard to use.

Flags

  • Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).

Groups

  • The special group 0 corresponds to the entire pattern.
  • In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<name> to the group, by its name. To create a group which will not be available later for retrieval, write (?:expression).
  • Example:
    • regularExpression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
    • mathObject = regularExpression.search(query_string)
    • We could access the price value by mathObject.group(1) or mathObject.group('Price'). For the quantity, it would be group(2) or group('Quantity').

Quoting part of a regular expression

  • This can be done via the \Q and \E constructs (you nest the literal inside those). This only seems to work in Java; at least in JavaScript, it is not available.

Regular expressions in Java

  • Documentation for the Pattern and Matcher classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations.
  • If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
  • The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
  • Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.

Manipulating results

  • You call the group(groupNumber) method on a matcher to retrieve the value of a group match.
  • There is no method for directly replacing the contents of a group submatch, but it is easy to write code such as:
value = value.substring(0, matcher.start(1)) + "#" + value.substring(matcher.end(1));

Warnings

  • Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs.

Regular expressions in Groovy

  • Groovy has the following shortcuts:
    • ==~ for matches().
    • =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like
if ("hello" =~ /hel/)

Be careful to include parenthesis in the following case:

if (! ("hello" =~ /hal/))
  • A pattern can be directly created via ~/foo/.

Escaping

  • If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance:
/(url\(\"http:\/\/(.*?))\"/
  • List of characters to escape:
    • double quotes,
    • slashes,
    • parenthesis.
  • If you use a regular expression in String mode, and you wish to match a literal $, you must use three slashes:
myString.replaceAll("\\\$\\{variable\\}", "something")
  • It is NOT necessary to escape single quotes.

Regular expressions in JavaScript

  • Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:
[\s\S]

rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freeze the JS execution engine on complex matches.

  • You can use the following function to quote a literal string for a regular expression (same as using \Q and \E in Java):
regularExpressionLiteral = regularExpressionLiteral(new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");

Regular expressions in Python

Official documentation available here.

Basic Operations

You can perform two basic operations: search and match. In Perl, search is always used.

  • match(): Determine if the RE matches at the beginning of the string.
  • search(): Scan through a string, looking for any location where this RE matches.

Regular Expressions in PHP

Things worth keeping in mind:

  • It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions.
  • Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern.

Functions

  • Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc...
preg_match($pattern, $data, $matches);
  • Use preg_replace if you want to replace some data inside a string.