The way you use regular expressions to express paterns is through the use of special characters that represent various types of wild card characters. These special characters can be grouped into three categories:
The basic special characters are as follows. Note the case of the characters. The lower-case letter tends to refer to a specific type or family of character, while the upper case version of the same refers to its opposite.
\d
This character refers to any one numeric digit. For instance, to look for dates
from the year year 2000 and later you could code: /2\d\d\d/
.
\D
This character refers to any one alphabetic character.
\w
This character refers to any word character. A word character is any character that is valid for use in a variable name except the dollar sign. In other words, 0 through 9, A through Z in either case, and the underscore character ( _ ).
\W
This character refers to any non-word character. In other words, any one
character that does not meet the criteria for \w
. It is useful for
searching for things like spaces and punctuation in a string.
\s
This character refers to any whitespace character. A whitespace character is any character treated as white-space in the code. This includes spaces, tabs and carriage returns.
\S
This character refers to any non-whitespace character. It refers to anything that isn't a space.
.
The period refers to any single character. It is a wild card that is the equivalent of using a question mark in UNIX file management.
[...]
Square brackets are used to delimit a list of possible values for a given
position in the search string. For instance, if you were only looking for words
that began with the letters A, B, or C, you could code: /\b[abc]/gi
.
[.-.]
You can also express a range of characters in the brackets by using a dash
between the beginning and ending character in the series. Thus, the regular
expression /\b[a-e]/gi
would look for words that began with the
letters A, B, C, D or E.
[^...]
By putting a carat in the brackets, you say any character in this position
except those listed. Thus, /\b[^abc]/gi
would look for all words
that begin with any character except A, B, or C. The carat has another meaning
outside of the brackets. Don't confuse them.
Take note that the backslash is used to mark many special characters. The backslash serves as an escape character. In other places punctuation is used to mark special characters. If you want punctuation to be taken as a literal character in a regular expression, you should always escape it with a backslash. On the other hand, alphanumeric characters should never be escaped unless you want them to be special characters. Thus, to look for end of sentence punctuation, you would have to code the following:
endOfSent = /[\.\?\!]\b/i;
So, what would happen if you wanted to search a string for the occurance of either the name "Paul" or the name "Saul"? This is a simple example. We could code the following:
// regualr expression to find Paul or Saul
pandsStr = /\b[ps]aul\b/gi;
If we were looking for student IDs, which here begin with two alphabetic characters followed by six numbers, we could code either of the following, as well as many others. We will learn shorter ways of doing this next:
stuID = /\b\D\D\d\d\d\d\d\d\b/gi; stuID = /\b[a-z][a-z]\d\d\d\d\d\d\b/gi;
Repetition characters allow us to specify that certain characters are to be repeated a certain number of times. All repetition characters always occur immediately after the element in the string they refer to. They always refer to a single character position unless you group characters, which we haven't discussed yet.
The repetition characters are as follows:
{n}
The preceding character must repeat n
times. Thus you might test a
year field for /\d{4}/
to make sure all four digits were entered.
{n,}
The preceding character must repeat at least n
times, but may
repeat more.
{n,m}
The preceding character msut repeat between n
and m
times.
If you wanted to make sure a password was between 6 and 12 characters in length
and contained valid word characters, you could code: /\w{6,12}/
.
?
The preceding character may occur zero or one times in the string. If it appears
zero times, then that position is ignored in the search string and the next
character is compared at the same location in the string being searched. For
instance, if you had let's say an inventory system where all part numbers were
assigned a six digit number, some of which were preceded by an alphabetic
character, you could code: /\D?\d{6}/
.
+
The preceding character must occur at least once in a string, but can occur multiple times.
*
The preceding character can occur zero or more times in the string. Use this
character with caution. For instance, /a*/
, would match any
string, since any string either does or does not have the letter A in it.
A more complicated example of repetition characters can be seen in testing whether a social security number is valid. If it is, then it has nine digits, and there may or may not be a dash between the third and fourth digit and between the fifth and sixth digit. How would we code for that?
Well the physical structure of the number can either be:
123456789
or
123-45-6789
.
Let's start with the more complex form.
The social security number with the dashes can be written as follows. Note that the backslashes before the minus signs are probably optional, but it never hurts to be safe.
// the string 123-45-6789
SSNStr = /\d\d\d\-\d\d\-\d\d\d\d/;
Now, the dashes are optional, so let's mark them as such:
SSNStr = /\d\d\d\-?\d\d\-?\d\d\d\d/;
The we can use repetition characters to make the code slightly easier to read:
SSNStr = /\d{3}\-?\d{2}\-?\d{4}/;
This will address both conditions because if there are no dashes, those positions in our string will be ignored.
Regular expressions also allow us to specify the position of characters in a
string. We have already encountered \b
, which is used to delimit a
word boundary. A
word boundary
is where a word begins or ends. The positioning characters are as follows:
\b
Delimits a word boundary. Thus if you only want to look for the word
"stop" when it does not occur inside another word, you could code: /\bstop\b/gi
.
\B
Indicates a non-boundary. What is a non-boundary? Well, if you wanted to find
"Paula" and "Pauline" but not "Paul", you could
code: /\bpaul\B/gi
. This would match words that begin with
"paul", but had additional characters after the L.
^
Indicates the beginning of a string or the beginning of a line in a multi-line
string. It appears at the beginning of the regular expression, since there
shouldn't be anything before it in the string. Don't confuse it with the
negation operator inside the square brackets. To check to see whether
"Paul" was the first word in a string, you could code: /^paul\b/i
.
$
Indicates the end of a string or the end of a line in a multi-line string. It
appears at the end of the regular expression, since there shouldn't be anything
after it in the string. To check to see whether "Paul" was the last
word in a string, you could code: /\bpaul$/i
. If you wanted to
allow for a punctuation character, you could also code: /\bpaul\W$/i
.