Perl Information Center Tutorials - Regular Expressions
These tutorials were written to help you get a quick, but thorough, understanding of Perl -
the scope of the language as well as it's specific capabilities.
| Beginners
| Built-In Functions
| Advanced
| CGI Applications
|
|
|
|
|
|
Regular Expressions - Quick Definition
A regular expression is simply a string of characters which specifies
a pattern to use in searching other strings. For example, the strings
"abc" and "a.c" are both valid regular expressions. The first is simply the
exact literal search string "abc" whereas the second is a pattern that describes a
3-character search string starting with a and ending with c. The period is one
of many wildcards (special characters), in this case meaning any single character.
Windows users should already be familiar with the wildcard concept, where
the asterisk "*" and question mark "?" are used as wildcards in strings
to form search patterns. For example, in a Windows Search, the string
"*.txt" searches for any text file. Or, the string "file?.txt"
searches for any file with the name file1.txt, file2.txt, etc.
Regular expressions are simply strings which contain wildcards. The difficulty
- and the power - is that Perl has a very long list of wildcards and special
rules on how to use them.
String Operators
Regular expressions by themselves are just strings. It's when regular
expressions are used in Perl string operators that the magic begins. So before
getting to examples of regular expressions, we need to go over the two
basic operators which can use regular expressions -
m// (matching), s/// (search and replace)
These operators work just fine using exact literal strings instead of regular
expressions, but they are far more useful when used in conjunction with regular
expressions.
Matching Operator - m//
The matching operator returns a true/false value, depending on
whether a substring is found in a larger string. The substring
is placed within the slashes. The larger string to be searched
is, by default, the variable $_. To use a matching operator,
simply place it in parentheses wherever any other expression
would be used, as in the following two examples.
Note that the 'm' is optional and most Perl programmers do not
use it. The following examples leave off the m.
if (/abc/) {print 'true'} # looks for abc in default variable $_
if (/a$k/) {print 'true'} # s/// interpolates $k, just like double quotes
if ($a=~/abc/) {print 'true'} # looks for abc in variable $a
In the 2nd example, the =~ operator is used to tell Perl to look
for the pattern in a string other than the default variable $_. The =~
operator binds the search to the specified variable - it is not an equality!
Substitution Operator - s///
The substitution operator replaces one string with another.
The string to be searched is, by default, the variable $_.
s/abc/def/; # abc is replaced with def in the variable $_
s/a$kc/def; # s/// interpolates $k, just like double quotes
$a=~s/abc/def; # abc is replaced with def in the variable $a
In the 2nd example, the =~ operator is used to tell Perl to look
for the pattern in a string other than the default variable $_. The =~
operator binds the search to the specified variable - it is not an equality!
RegEx Strings
In all of the examples above exact strings (including interpolated variables)
were used to define the search pattern. Although exact strings are considered
regular expressions, the real power of regular expressions comes by adding
wildcards (special characters) to the strings.
Regular expressions recognize special characters which can define the pattern
in four ways:
define a character type
define a range of characters
define the position of a character and it's adjacent characters
define the number of occurrences of a character
Here are the special characters for each of the four ways in which a
pattern may be define.
- Define character type
. a single character (except \n)
\s a whitespace character (space, tab, newline)
\S non-whitespace character
\d a digit (0-9)
\D a non-digit
\w a word character (a-z, A-Z, 0-9, _)
\W a non-word character
- Define range/list
Square brackets are used to define a single character which matches
any character within the square brackets, as in the following examples.
[0-9] any single digits 0 through 9
[a-z] any single lower case letter a through z
[0-9a-zA-Z] any single digit or letter
[^0-9a-zA-Z] not digit or letter (within pattern, ^ is negation)
[ ] single space
[ \t] single space or tab
[0-9][a-z] single digit followed by single letter
[aeiou] any vowel
[^aeiou] any single character outside the given set
(foo|bar|baz) matches any of the alternatives specified
- Define position or adjacent characters
Regular expressions allow the location of a matching pattern to be
defined, as well as placing limits adjacent characters.
^ refers to the start of the line
$ refers to the end of the line
\b refers to a word boundary
- Define number of occurrences
Regular expressions provide the following shortcuts to defining
how many of a character are in a pattern. Exact quantities and
ranges are both supported.
* zero or more of the previous thing
+ one or more of the previous thing
? zero or one of the previous thing
{3} matches exactly 3 of the previous thing
{3,6} matches between 3 and 6 of the previous thing
{3,} matches 3 or more of the previous thing
Automatic Special Variable Assignment
After each successful match, Perl sets several of it's special
variables. These automatic assignments make information
available to the program with no effort on the part of the
programmer.
@- offset of start of last match
@+ offset of end of last match
$& string last matched
$` string preceding last match
$' string following last match
$+ text matched by last ()-enclosed pattern
$1, $2.. text matched by each ()-enclosed pattern
Regarding the last example, when one or more parentheses are used
to group portions of a pattern the result of the search that matches
the enclosed portion of the pattern will be placed by Perl in $1, $2,
and so on.
/(a.)(b$)/ $1 holds match of (a.), $2 holds match of (b$)
Pattern Examples
Here are several example regular expressions which combine the use of
the special character types discussed above. All of these search
on the content of the special variable $_. Matches are listed first,
followed by substitutions.
/^$/ empty line (start and end are adjacent)
/^a/ starts with the letter a
/^\d/ starts with any digit
/^\d+/ starts with one or more digits
/a$/ ends with the letter a
/[ ]+/ one or more spaces in a sequence
/^[ ]*$/ line with zero or more spaces and nothing else
/^[^0-9a-zA-Z \t]/ line beginning with non-digit,letter,space
/[0-9]{3,5}/ 3 to 5 digits in a sequence
/\bdoc/ word beginning with doc
/ment\b/ word ending with ment
/(\d\s){3}/ 3 digits, each followed by whitespace (eg "3 4 5 ")
/(a.)+/ every odd-numbered letter is a (eg "abacadaf")
s/\n//g remove newlines
s/\W//g; removes spaces and non-alpha
s/<([^>]|\n)*>//g strip HTML tags from a string
s/<[^>]+>//g remove all HTML
s/^\s+// strip leading spaces from a string
s/\s+$// strip trailing spaces from a string
s/^.*\/// remove leading path from a filename
If you have any suggestions or corrections, please let me know.
|