Getting Started
Introduction
Perl IDEs
History
Advice
Tools
Mini-Tutorial
Tutorial
Code Snippets

Resources
Top Sites
More Tutorials
Books
Magazines
Articles
NewsLetters
Mailing Lists
NewsGroups
Forums
User Groups
Talk Shows
Blogs
Clothing

GBIC >> Perl >> Information Center Tutorials >> Regular Expressions

Perl Information Center Tutorials - Regular Expressions
These tutorials were written to help you get a quick, but thorough, understanding of Perl - the scope of the language as well as it's specific capabilities.

Beginners Built-In Functions     Advanced CGI Applications

Regular Expressions - Quick Definition
A regular expression is simply a string of characters which specifies a pattern to use in searching other strings. For example, the strings "abc" and "a.c" are both valid regular expressions. The first is simply the exact literal search string "abc" whereas the second is a pattern that describes a 3-character search string starting with a and ending with c. The period is one of many wildcards (special characters), in this case meaning any single character.

Windows users should already be familiar with the wildcard concept, where the asterisk "*" and question mark "?" are used as wildcards in strings to form search patterns. For example, in a Windows Search, the string "*.txt" searches for any text file. Or, the string "file?.txt" searches for any file with the name file1.txt, file2.txt, etc.

Regular expressions are simply strings which contain wildcards. The difficulty - and the power - is that Perl has a very long list of wildcards and special rules on how to use them.

String Operators
Regular expressions by themselves are just strings. It's when regular expressions are used in Perl string operators that the magic begins. So before getting to examples of regular expressions, we need to go over the two basic operators which can use regular expressions - m// (matching), s/// (search and replace)

These operators work just fine using exact literal strings instead of regular expressions, but they are far more useful when used in conjunction with regular expressions.

Matching Operator - m//
The matching operator returns a true/false value, depending on whether a substring is found in a larger string. The substring is placed within the slashes. The larger string to be searched is, by default, the variable $_. To use a matching operator, simply place it in parentheses wherever any other expression would be used, as in the following two examples.

Note that the 'm' is optional and most Perl programmers do not use it. The following examples leave off the m.

     if (/abc/) {print 'true'}      # looks for abc in default variable $_
     if (/a$k/) {print 'true'}      # s/// interpolates $k, just like double quotes
     if ($a=~/abc/) {print 'true'}  # looks for abc in variable $a

In the 2nd example, the =~ operator is used to tell Perl to look for the pattern in a string other than the default variable $_. The =~ operator binds the search to the specified variable - it is not an equality!

Substitution Operator - s///
The substitution operator replaces one string with another. The string to be searched is, by default, the variable $_.

     s/abc/def/;     # abc is replaced with def in the variable $_
     s/a$kc/def;     # s/// interpolates $k, just like double quotes
     $a=~s/abc/def;  # abc is replaced with def in the variable $a

In the 2nd example, the =~ operator is used to tell Perl to look for the pattern in a string other than the default variable $_. The =~ operator binds the search to the specified variable - it is not an equality!

RegEx Strings
In all of the examples above exact strings (including interpolated variables) were used to define the search pattern. Although exact strings are considered regular expressions, the real power of regular expressions comes by adding wildcards (special characters) to the strings.

Regular expressions recognize special characters which can define the pattern in four ways:

    define a character type
    define a range of characters
    define the position of a character and it's adjacent characters
    define the number of occurrences of a character

Here are the special characters for each of the four ways in which a pattern may be define.

  • Define character type
        .                   a single character (except \n)
        \s                  a whitespace character (space, tab, newline)
        \S                  non-whitespace character
        \d                  a digit (0-9)
        \D                  a non-digit
        \w                  a word character (a-z, A-Z, 0-9, _)
        \W                  a non-word character
    

  • Define range/list
    Square brackets are used to define a single character which matches any character within the square brackets, as in the following examples.

         [0-9]           any single digits 0 through 9
         [a-z]           any single lower case letter a through z
         [0-9a-zA-Z]     any single digit or letter
         [^0-9a-zA-Z]    not digit or letter (within pattern, ^ is negation)
         [ ]             single space
         [ \t]           single space or tab
         [0-9][a-z]      single digit followed by single letter
         [aeiou]         any vowel
         [^aeiou]        any single character outside the given set
        (foo|bar|baz)       matches any of the alternatives specified
    

  • Define position or adjacent characters
    Regular expressions allow the location of a matching pattern to be defined, as well as placing limits adjacent characters.
         ^      refers to the start of the line
         $      refers to the end of the line
         \b     refers to a word boundary
    

  • Define number of occurrences
    Regular expressions provide the following shortcuts to defining how many of a character are in a pattern. Exact quantities and ranges are both supported.
         *               zero or more of the previous thing
         +               one or more of the previous thing
         ?               zero or one of the previous thing
         {3}             matches exactly 3 of the previous thing
         {3,6}           matches between 3 and 6 of the previous thing
         {3,}            matches 3 or more of the previous thing

Automatic Special Variable Assignment
After each successful match, Perl sets several of it's special variables. These automatic assignments make information available to the program with no effort on the part of the programmer.

     @-        offset of start of last match
     @+        offset of end of last match
     $&        string last matched
     $`        string preceding last match
     $'        string following last match
     $+        text matched by last ()-enclosed pattern
     $1, $2..  text matched by each ()-enclosed pattern

Regarding the last example, when one or more parentheses are used to group portions of a pattern the result of the search that matches the enclosed portion of the pattern will be placed by Perl in $1, $2, and so on.

     /(a.)(b$)/   $1 holds match of (a.), $2 holds match of (b$)

Pattern Examples
Here are several example regular expressions which combine the use of the special character types discussed above. All of these search on the content of the special variable $_. Matches are listed first, followed by substitutions.

     /^$/                 empty line (start and end are adjacent)
     /^a/                 starts with the letter a
     /^\d/                starts with any digit
     /^\d+/               starts with one or more digits
     /a$/                 ends with the letter a
     /[ ]+/               one or more spaces in a sequence
     /^[ ]*$/             line with zero or more spaces and nothing else
     /^[^0-9a-zA-Z \t]/   line beginning with non-digit,letter,space
     /[0-9]{3,5}/         3 to 5 digits in a sequence
     /\bdoc/              word beginning with doc
     /ment\b/             word ending with ment
     /(\d\s){3}/          3 digits, each followed by whitespace (eg "3 4 5 ")
     /(a.)+/              every odd-numbered letter is a (eg "abacadaf")

     s/\n//g              remove newlines
     s/\W//g;             removes spaces and non-alpha
     s/<([^>]|\n)*>//g    strip HTML tags from a string 	
     s/<[^>]+>//g         remove all HTML
     s/^\s+//             strip leading spaces from a string 	
     s/\s+$//             strip trailing spaces from a string 	
     s/^.*\///            remove leading path from a filename

If you have any suggestions or corrections, please let me know.