Regular Expressions: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
Line 26: | Line 26: | ||
* The special group 0 corresponds to the entire pattern. | * The special group 0 corresponds to the entire pattern. | ||
* In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<''name''> to the group, by its name. To create a group which will not be available later for retrieval, write (?:''expression''). | |||
* | |||
In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<''name''> to the group, by its name. To create a group which will not be available later for retrieval, write (?:''expression''). | |||
* Example: | * Example: | ||
Line 50: | Line 32: | ||
** math_object = regular_expression.search(query_string) | ** math_object = regular_expression.search(query_string) | ||
** We could access the price value by math_object.group(1) or math_object.group('Price'). For the quantity, it would be group(2) or group('Quantity'). | ** We could access the price value by math_object.group(1) or math_object.group('Price'). For the quantity, it would be group(2) or group('Quantity'). | ||
== Quoting part of a regular expression == | |||
* This can be done via the \Q and \E constructs (you nest the literal inside those). | |||
= Regular expressions in Java = | = Regular expressions in Java = | ||
Line 96: | Line 82: | ||
rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freewe the JS execution engine on complex matches. | rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freewe the JS execution engine on complex matches. | ||
= Regular Expressions in PHP = | |||
Things worth keeping in mind: | |||
* It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions. | |||
* Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern. | |||
== Functions == | |||
* Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc... | |||
preg_match($pattern, $data, $matches); | |||
* Use preg_replace if you want to replace some data inside a string. |
Revision as of 13:52, 7 December 2010
This article is an introduction to an extremely powerful tool available to any programmer, Regular Expressions.
Testing
- The easiest way to test regular expressions (PCRE style) is probably to use Perl directly like this.
echo "input" | perl -n -e '/regexpHere/ and print'
- You can also use grep but it does not support PCRE syntax natively (GNU grep has the -P switch which does).
Basics
Operators
- ^ represents the start of the string to search, or a newline in multiline mode. It is optional. Note that if you use syntax such as '^.*', you can just remove it altogether: it's better just not to use ^ in this case.
- $ delimits the end of the string or just before the newline in multiline mode. It is optional.
- *? is the non-greedy operator, it will match as little text as possible. +?, ??, or {m,n}? are also available.
- ? allows to match optionally only one expression. It should be put after the expression, eg ab? will match either a or ab.
- {n} asks for a repetition of n times. Be careful that because of this the { and } symbols can be considered special and should be escaped. However, if the regular expression engine does not detect an integer inside, it can assume it is a literal.
Flags
- Normally the "." special character matches any character except newlines. You can activate matching of newline with the DOTALL flag. In a regexp this is activated via (?s).
Groups
- The special group 0 corresponds to the entire pattern.
- In a regular expression, we often want to extract a particular piece of information from a string. We need to enclose the relevant "sub expression" in parenthesis. In Python, we can then refer to this group by its number, or, if we add ?P<name> to the group, by its name. To create a group which will not be available later for retrieval, write (?:expression).
- Example:
- regular_expression = re.compile(r"&price=(?P<Price>.*)&quantity=(?P<Quantity>.*)")
- math_object = regular_expression.search(query_string)
- We could access the price value by math_object.group(1) or math_object.group('Price'). For the quantity, it would be group(2) or group('Quantity').
Quoting part of a regular expression
- This can be done via the \Q and \E constructs (you nest the literal inside those).
Regular expressions in Java
- Documentation for the Pattern and Matcher classes. Basically, you create a Pattern using a string corresponding to a regular expression; then you call matcher() on the Pattern object, giving as argument the input string. The resulting Matcher object can be used for the standard operations.
- If you use a string to create a regular expression, you need to double escape special characters, eg \\*. This is because the first backslash is needed to escape the backslash itself in Java.
- The matches() method of the Matcher class is similar to match() in Python; find() is similar to search().
- Pattern.quote() is a static method that will produce a literal regular expression out of a string. This is useful for matching literally strings including characters such as "*" for example.
Warnings
- Be careful that once a matcher is created, it is created for a whole content (string). The content is copied onto the matcher so later changing the original content won't change the content of the matcher: it can lead to subtle bugs.
Regular expressions in Groovy
- Groovy has the following shortcuts:
- ==~ for matches().
- =~ for creating a matcher. The matcher is coerced to a Boolean via its find() method, thus you can write stuff like
if ("hello" =~ /hel/)
Be careful to include parenthesis in the following case:
if (! ("hello" =~ /hal/))
- A pattern can be directly created via ~/foo/.
Escaping
- If you use / / to create a regular expression, you need to escape the slashes. All escaping is done through a single backslash (rather than 2 backslashes when using a String). For instance:
/(url\(\"http:\/\/(.*?))\"/
- List of characters to escape:
- double quotes,
- slashes,
- parenthesis.
- It is NOT necessary to escape single quotes.
Regular expressions in JavaScript
- Be careful that there is no DOTALL mode on JS. \n will match a newline, independently of the OS representation, on Firefox but NOT on Opera (at least; maybe also on IE). So if you need to match any character the best is to use:
[\s\S]
rather than . with DOTALL mode. Do not use (.|\n) and do not use (.|[\s\S])! The second may seem to work but will actually freewe the JS execution engine on complex matches.
Regular Expressions in PHP
Things worth keeping in mind:
- It is better to code your patterns in single quote strings, as it then won't interpret variables starting with $. Note if you use double-quote encoding, just typing a backslash (\) character in front of the $ in the hope that it won't expand the variable WON'T work - as $ is already a special character within regular expressions.
- Apparently patterns need a delimiter if your beginning or ending characters are special - typically you can just use "/" at the start and the end of the pattern.
Functions
- Use preg_match if you want to get some data inside a string via a regular expression. You can access the groups you created via $matches[1], $matches[2], etc...
preg_match($pattern, $data, $matches);
- Use preg_replace if you want to replace some data inside a string.