Regular Expressions in PHP part 1
Introduction
Let's analyze the basic structure of basic regular expression patterns in PHP
that can be, and are used very frequently as a standalone expression, or in different combinations to validate simple and complex user inputs and in similar situations. We will use the preg_match_all
function in this example to return the matched values contained within a $matches
array to see what we get for different regex patterns.
Test string
Image below shows the string variable that I will be testing with regular expressions in this article.
Ranges
1) Number range
This pattern will find and return all number characters that fit into the 0-9
character range. The forward slash symbols represent the main delimiter that encapsulates the whole pattern. The square brackets and the - character designate that the pattern is a range of some sort. If we hadn't used the hyphen, the characters would be treated and matched as single characters in a string of characters.
Code
Result
2) Negate number range
This will find and return all the characters that do not fit into the 0-9
character range. The ^
meta character in this context means the negation of the pattern, so essentially it is saying "give me back all that doesn't belong into the 0-9
range", with whitespaces included.
Code
Result
3) Lowercase letters range
This will find and return all characters that fit into the a-z
character range. This pattern is case sensitive and it will match only the lowercase letters in the a-z
range so we have to have this in mind while using it.
Code
Result
4) Negate lowercase letter range
This will return all characters that do not fit into the lowercase a-z
character range. Have in mind that this will also return the whitespaces and symbols, so literally everything that doesn't fit into the a-z
range.
Code
Result
5) Uppercase letter range
Similar to a-z
range pattern but this one will match only the uppercase letters.
Code
Result
6) Case insensitive letters range
This will return letters that fit into the lowercase a-z
or A-Z
character range. The letter i
at the end of the pattern represents the case insensitive declaration.
Code
Result
Individual characters
7) Literal string values
This will match the first occurrence of a literal a-z
pattern, so if we would have a-z
sequence of characters anywhere in our string, this would match that part of the string. We see here how careful we should be with range and individual regular expressions, brackets that wrap around a set of characters make a whole lot a difference in what the final output of preg_match_all
will be.
Code
Result
8) Literal number values
This pattern will match all occurrences of a literal 123
sequence in a string. We see here how careful we should be with the range patterns, because the square brackets that wrap around a set of characters make a whole lot a difference in what the final pattern will be. If we would wrap 123
in square brackets, so if the pattern would be [123]
, this pattern would match every number that corresponds to either 1, 2 or 3. That would give totally different result than the 123
sequence pattern that would match only sequential 123 numbers in that exact order. Just something to have in mind.
Code
Result
9) Non word characters
This pattern will match all individual non word characters in a string. Note that "non word" includes whitespaces also (besides most of the symbols).
Code
Result
10) Word characters
This pattern will match all individual word characters in a string. This basically includes all the letters and numbers and excludes most of the symbols (it includes underscores however).
Code
Result
11) Whitespaces
This pattern will match all individual space characters in a string.
Code
Result
12) Non whitespace characters
This will match all individual non space characters in a string.
Code
Result
Different combinations, quantifiers and groups
13) Using the word boundary metacharacter
This pattern will match the ipsum characters only if these form a whole word (so the word ipsum). The \b
part of the pattern represents a word boundary metacharacter, and this is what tells the function that it is the the whole word that has to be matched for, without it the pattern would match every ipsum in any string that contains that sequence of characters, regardless of other characters in that string. The i
designates that the pattern is case insensitive.
Code
Result
14) Using the OR metacharacter
This pattern will match all non word characters defined by \W
condition, or all
numbers defined by a 0-9
range in a string. The OR metacharacter in the pattern is defined by the pipe character that separates two different character sets of the pattern.
Code
Result
15) Quantifiers
This will match all lowercase letters, but it will chunk the result into separate 2 letter items because
of the {2}
quantifier that is defined at the end of the pattern.
Code
Result
16) Combine character range, quantifiers and word boundary metacharacters
Unlike the previous pattern that will match all lowercase letters, but return them chunked into 2 character sets, this pattern will match only whole words that contain exactly 2 characters that can be a-z
letters. The \b
character, that represents a word boundary, at the start and at the end of the pattern makes all the difference. So in this case preg_match_all
will return an empty array in $matches
because our string doesn't contain a single word that is 2 characters long.
Code
Result
17) Grouping characters
This pattern is going to match 1234 characters, but additionally grouped and separated by the group metacharacter (the parenthesis) in 2 separate groups, and returned as such by the preg_match_all
function. First returned value will be the whole pattern at the $matches[0]
position, and remaining two positions will be reserved for the character groups as defined by the pattern.
Code
Result
18) Combine grouping and or metacharacters
This pattern combines the pipe character (the OR metacharacter) to divide a pattern in two groups.
It separates the match values in a \W
group and the [0-9]
group of characters. In this case the first group will contain all non word characters (the \W
part), and the second group will contain all numbers (the [0-9]
part). Have in mind that all $matches
array items have the length of the first item, which can look confusing at times in this type of behavior.
Code
Result
Our next article about Regular Expressions in PHP will demonstrate some real life examples of complex RegEx patterns. Happy coding.