Regular Expressions: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
(27 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions. | This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions. | ||
= Testing = | |||
* The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this. | * The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this. | ||
echo "input" | perl -n -e '/regexpHere/ and print' | echo "input" | perl -n -e '/regexpHere/ and print' | ||
* To print a match (and not the whole line), use: | |||
echo "input" | perl -n -e '/(regexpHere)/ and print $1' | |||
* You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does). | * You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does). | ||
== Examples == | |||
* To find a <nowiki><li> followed by a <ul> without a </li></nowiki> element between (general note: you should very rarely use regular expressions to parse HTML): | |||
<pre> | |||
/(?s)<li((?!<\/li).)*?<ul>/ | |||
</pre> | |||
= Basics = | |||
== Operators == | == Operators == | ||
Line 15: | Line 28: | ||
* *? is the non-greedy operator, it will match ''as little text as possible.'' +?, ??, or {m,n}? are also available. | * *? is the non-greedy operator, it will match ''as little text as possible.'' +?, ??, or {m,n}? are also available. | ||
* ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab. | * ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab. | ||
* {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal. | |||
* (?!...) is for a negative look ahead. Generally hard to use. | |||
== Flags == | |||
* Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s). | |||
== Groups == | |||
* The special group 0 corresponds to the entire pattern. | |||
* In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<''name''> to the group, by its name. To create a group which will not be available later for retrieval, write (?:''expression''). | |||
== | * Example: | ||
** regularExpression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)") | |||
** mathObject = regularExpression.search(query_string) | |||
** We could access the price value by mathObject.group(1) or mathObject.group('Price'). For the quantity, it would be group(2) or group('Quantity'). | |||
== Quoting part of a regular expression == | |||
* This can be done via the \Q and \E constructs (you nest the literal inside those). This only seems to work in Java; at least in JavaScript, it is not available. | |||
= Regular expressions in Java = | |||
* | * Documentation for the [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html Pattern] and [http://java.sun.com/javase/6/docs/api/java/util/regex/Matcher.html Matcher] classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations. | ||
* If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java. | |||
* The matches() method of the Matcher class is similar to match() in Python; find() is similar to search(). | |||
* Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example. | |||
== Manipulating results == | |||
* | * You call the group(groupNumber) method on a matcher to retrieve the value of a group match. | ||
* There is no method for directly replacing the contents of a group submatch, but it is easy to write code such as: | |||
* | |||
= | value = value.substring(0, matcher.start(1)) + "#" + value.substring(matcher.end(1)); | ||
== Warnings == | |||
* | * Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs. | ||
= Regular expressions in Groovy = | |||
* [http://groovy.codehaus.org/Regular+Expressions Documentation.] | * [http://groovy.codehaus.org/Regular+Expressions Documentation.] | ||
Line 49: | Line 75: | ||
** ==~ for matches(). | ** ==~ for matches(). | ||
** =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like | ** =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like | ||
if ( "hello" =~ /hel/) | if ("hello" =~ /hel/) | ||
Be careful to include parenthesis in the following case: | |||
if (! ("hello" =~ /hal/)) | |||
* A pattern can be directly created via ~/foo/. | * A pattern can be directly created via ~/foo/. | ||
== Escaping == | |||
* If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance: | |||
/(url\(\"http:\/\/(.*?))\"/ | |||
* List of characters to escape: | |||
** double quotes, | |||
** slashes, | |||
** parenthesis. | |||
* If you use a regular expression in String mode, and you wish to match a literal $, you must use three slashes: | |||
myString.replaceAll("\\\$\\{variable\\}", "something") | |||
* It is '''NOT necessary''' to escape single quotes. | |||
= Regular expressions in JavaScript = | |||
* Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use: | |||
[\s\S] | |||
rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freeze the JS execution engine on complex matches. | |||
* You can use the following function to quote a literal string for a regular expression (same as using \Q and \E in Java): | |||
regularExpressionLiteral = regularExpressionLiteral(new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&"); | |||
= Regular expressions in Python = | |||
[http://docs.python.org/lib/module-re.html Official documentation available here.] | |||
== Basic Operations == | |||
You can perform two basic operations: ''search'' and ''match''. In Perl, ''search'' is always used. | |||
* match(): Determine if the RE matches at the beginning of the string. | |||
* search(): Scan through a string, looking for any location where this RE matches. | |||
= Regular Expressions in PHP = | |||
Things worth keeping in mind: | |||
* It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions. | |||
* Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern. | |||
== Functions == | |||
* Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc... | |||
preg_match($pattern, $data, $matches); | |||
* Use preg_replace if you want to replace some data inside a string. |
Latest revision as of 20:49, 4 December 2011
This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions.
Testing
- The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this.
echo "input" | perl -n -e '/regexpHere/ and print'
- To print a match (and not the whole line), use:
echo "input" | perl -n -e '/(regexpHere)/ and print $1'
- You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).
Examples
- To find a <li> followed by a <ul> without a </li> element between (general note: you should very rarely use regular expressions to parse HTML):
/(?s)<li((?!<\/li).)*?<ul>/
Basics
Operators
- ^ represents the start of the string to search, or a newline in multiline mode. It is optional. Note that if you use syntax such as '^.*', you can just remove it altogether: it's better just not to use ^ in this case.
- $ delimits the end of the string or just before the newline in multiline mode. It is optional.
- *? is the non-greedy operator, it will match as little text as possible. +?, ??, or {m,n}? are also available.
- ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
- {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal.
- (?!...) is for a negative look ahead. Generally hard to use.
Flags
- Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).
Groups
- The special group 0 corresponds to the entire pattern.
- In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<name> to the group, by its name. To create a group which will not be available later for retrieval, write (?:expression).
- Example:
- regularExpression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
- mathObject = regularExpression.search(query_string)
- We could access the price value by mathObject.group(1) or mathObject.group('Price'). For the quantity, it would be group(2) or group('Quantity').
Quoting part of a regular expression
- This can be done via the \Q and \E constructs (you nest the literal inside those). This only seems to work in Java; at least in JavaScript, it is not available.
Regular expressions in Java
- Documentation for the Pattern and Matcher classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations.
- If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
- The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
- Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.
Manipulating results
- You call the group(groupNumber) method on a matcher to retrieve the value of a group match.
- There is no method for directly replacing the contents of a group submatch, but it is easy to write code such as:
value = value.substring(0, matcher.start(1)) + "#" + value.substring(matcher.end(1));
Warnings
- Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs.
Regular expressions in Groovy
- Groovy has the following shortcuts:
- ==~ for matches().
- =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like
if ("hello" =~ /hel/)
Be careful to include parenthesis in the following case:
if (! ("hello" =~ /hal/))
- A pattern can be directly created via ~/foo/.
Escaping
- If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance:
/(url\(\"http:\/\/(.*?))\"/
- List of characters to escape:
- double quotes,
- slashes,
- parenthesis.
- If you use a regular expression in String mode, and you wish to match a literal $, you must use three slashes:
myString.replaceAll("\\\$\\{variable\\}", "something")
- It is NOT necessary to escape single quotes.
Regular expressions in JavaScript
- Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:
[\s\S]
rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freeze the JS execution engine on complex matches.
- You can use the following function to quote a literal string for a regular expression (same as using \Q and \E in Java):
regularExpressionLiteral = regularExpressionLiteral(new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");new RegExp("[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\-]", "g"), "\\$&");
Regular expressions in Python
Official documentation available here.
Basic Operations
You can perform two basic operations: search and match. In Perl, search is always used.
- match(): Determine if the RE matches at the beginning of the string.
- search(): Scan through a string, looking for any location where this RE matches.
Regular Expressions in PHP
Things worth keeping in mind:
- It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions.
- Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern.
Functions
- Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc...
preg_match($pattern, $data, $matches);
- Use preg_replace if you want to replace some data inside a string.