Regular Expressions: Difference between revisions

From Elvanör's Technical Wiki
Jump to navigation Jump to search
No edit summary
 
(22 intermediate revisions by the same user not shown)
Line 6: Line 6:


  echo "input" | perl -n -e '/regexpHere/ and print'
  echo "input" | perl -n -e '/regexpHere/ and print'
* To print a match (and not the whole line), use:
echo "input" | perl -n -e '/(regexpHere)/ and print $1'


* You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).
* You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).


= Operators =
== Examples ==
 
* To find a <nowiki><li> followed by a <ul> without a </li></nowiki> element between (general note: you should very rarely use regular expressions to parse HTML):
<pre>
/(?s)<li((?!<\/li).)*?<ul>/
</pre>
 
= Basics =
 
== Operators ==


* ^ represents the start of the string to search, or a newline in multiline mode. It is optional. Note that if you use syntax such as '^.*', you can just remove it altogether: it's better just not to use ^ in this case.
* ^ represents the start of the string to search, or a newline in multiline mode. It is optional. Note that if you use syntax such as '^.*', you can just remove it altogether: it's better just not to use ^ in this case.
Line 15: Line 28:
* *? is the non-greedy operator, it will match ''as little text as possible.'' +?, ??, or {m,n}? are also available.
* *? is the non-greedy operator, it will match ''as little text as possible.'' +?, ??, or {m,n}? are also available.
* ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
* ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
* {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal.
* (?!...) is for a negative look ahead. Generally hard to use.


= Flags =
== Flags ==


* Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).
* Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).


= Quoting part of a regular expression =
== Groups ==


* This can be done via the \Q and \E constructs (you nest the literal inside those).
* The special group 0 corresponds to the entire pattern.
* In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<''name''> to the group, by its name. To create a group which will not be available later for retrieval, write (?:''expression'').


= Regular expressions in Python =
* Example:
** regularExpression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
** mathObject = regularExpression.search(query_string)
** We could access the price value by mathObject.group(1) or mathObject.group('Price'). For the quantity, it would be group(2) or group('Quantity').


[http://docs.python.org/lib/module-re.html Official documentation available here.]
== Quoting part of a regular expression ==


== Basic Operations ==
* This can be done via the \Q and \E constructs (you nest the literal inside those). This only seems to work in Java; at least in JavaScript, it is not available.


You can perform two basic operations: ''search'' and ''match''. In Perl, ''search'' is always used.
= Regular expressions in Java =


* match(): Determine if the RE matches at the beginning of the string.
* Documentation for the [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html Pattern] and [http://java.sun.com/javase/6/docs/api/java/util/regex/Matcher.html Matcher] classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations.
* search(): Scan through a string, looking for any location where this RE matches.


== Groups ==
* If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
* The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
* Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.


In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<''name''> to the group, by its name. To create a group which will not be available later for retrieval, write (?:''expression'').
== Manipulating results ==


* Example:
* You call the group(groupNumber) method on a matcher to retrieve the value of a group match.
** regular_expression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
* There is no method for directly replacing the contents of a group submatch, but it is easy to write code such as:
** math_object = regular_expression.search(query_string)
** We could access the price value by math_object.group(1) or math_object.group('Price'). For the quantity, it would be group(2) or group('Quantity').


= Regular expressions in Java =
value = value.substring(0, matcher.start(1)) + "#" + value.substring(matcher.end(1));


* Documentation for the [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html Pattern] and [http://java.sun.com/javase/6/docs/api/java/util/regex/Matcher.html Matcher] classes.
== Warnings ==


* If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
* Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs.
* The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
* Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.


= Regular expressions in Groovy =
= Regular expressions in Groovy =
Line 64: Line 80:


* A pattern can be directly created via ~/foo/.
* A pattern can be directly created via ~/foo/.
== Escaping ==
* If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance:
/(url\(\"http:\/\/(.*?))\"/
* List of characters to escape:
** double quotes,
** slashes,
** parenthesis.
* If you use a regular expression in String mode, and you wish to match a literal $, you must use three slashes:
myString.replaceAll("\\\$\\{variable\\}", "something")
* It is '''NOT necessary''' to escape single quotes.


= Regular expressions in JavaScript =
= Regular expressions in JavaScript =
Line 69: Line 102:
* Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:
* Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:


  (.|[\s\S])
  [\s\S]
 
rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freeze the JS execution engine on complex matches.
 
* You can use the following function to quote a literal string for a regular expression (same as using \Q and \E in Java):
 
regularExpressionLiteral = regularExpressionLiteral(new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");
 
= Regular expressions in Python =
 
[http://docs.python.org/lib/module-re.html Official documentation available here.]
 
== Basic Operations ==
 
You can perform two basic operations: ''search'' and ''match''. In Perl, ''search'' is always used.
 
* match(): Determine if the RE matches at the beginning of the string.
* search(): Scan through a string, looking for any location where this RE matches.
 
= Regular Expressions in PHP =
 
Things worth keeping in mind:
 
* It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions.
 
* Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern.
 
== Functions ==
 
* Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc...
 
preg_match($pattern, $data, $matches);


rather than . with DOTALL mode. Do not use (.|\n)!
* Use preg_replace if you want to replace some data inside a string.

Latest revision as of 20:49, 4 December 2011

This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions.

Testing

  • The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this.
echo "input" | perl -n -e '/regexpHere/ and print'
  • To print a match (and not the whole line), use:
echo "input" | perl -n -e '/(regexpHere)/ and print $1'
  • You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).

Examples

  • To find a <li> followed by a <ul> without a </li> element between (general note: you should very rarely use regular expressions to parse HTML):
 /(?s)<li((?!<\/li).)*?<ul>/

Basics

Operators

  • ^ represents the start of the string to search, or a newline in multiline mode. It is optional. Note that if you use syntax such as '^.*', you can just remove it altogether: it's better just not to use ^ in this case.
  • $ delimits the end of the string or just before the newline in multiline mode. It is optional.
  • *? is the non-greedy operator, it will match as little text as possible. +?, ??, or {m,n}? are also available.
  • ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
  • {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal.
  • (?!...) is for a negative look ahead. Generally hard to use.

Flags

  • Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).

Groups

  • The special group 0 corresponds to the entire pattern.
  • In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<name> to the group, by its name. To create a group which will not be available later for retrieval, write (?:expression).
  • Example:
    • regularExpression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
    • mathObject = regularExpression.search(query_string)
    • We could access the price value by mathObject.group(1) or mathObject.group('Price'). For the quantity, it would be group(2) or group('Quantity').

Quoting part of a regular expression

  • This can be done via the \Q and \E constructs (you nest the literal inside those). This only seems to work in Java; at least in JavaScript, it is not available.

Regular expressions in Java

  • Documentation for the Pattern and Matcher classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations.
  • If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
  • The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
  • Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.

Manipulating results

  • You call the group(groupNumber) method on a matcher to retrieve the value of a group match.
  • There is no method for directly replacing the contents of a group submatch, but it is easy to write code such as:
value = value.substring(0, matcher.start(1)) + "#" + value.substring(matcher.end(1));

Warnings

  • Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs.

Regular expressions in Groovy

  • Groovy has the following shortcuts:
    • ==~ for matches().
    • =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like
if ("hello" =~ /hel/)

Be careful to include parenthesis in the following case:

if (! ("hello" =~ /hal/))
  • A pattern can be directly created via ~/foo/.

Escaping

  • If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance:
/(url\(\"http:\/\/(.*?))\"/
  • List of characters to escape:
    • double quotes,
    • slashes,
    • parenthesis.
  • If you use a regular expression in String mode, and you wish to match a literal $, you must use three slashes:
myString.replaceAll("\\\$\\{variable\\}", "something")
  • It is NOT necessary to escape single quotes.

Regular expressions in JavaScript

  • Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:
[\s\S]

rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freeze the JS execution engine on complex matches.

  • You can use the following function to quote a literal string for a regular expression (same as using \Q and \E in Java):
regularExpressionLiteral = regularExpressionLiteral(new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");

Regular expressions in Python

Official documentation available here.

Basic Operations

You can perform two basic operations: search and match. In Perl, search is always used.

  • match(): Determine if the RE matches at the beginning of the string.
  • search(): Scan through a string, looking for any location where this RE matches.

Regular Expressions in PHP

Things worth keeping in mind:

  • It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions.
  • Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern.

Functions

  • Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc...
preg_match($pattern, $data, $matches);
  • Use preg_replace if you want to replace some data inside a string.