Regular Expressions in PHP part 1

Regular Expressions in PHP part 1

Introduction

Let's analyze the basic structure of basic regular expression patterns in PHP that can be, and are used very frequently as a standalone expression, or in different combinations to validate simple and complex user inputs and in similar situations. We will use the preg_match_all function in this example to return the matched values contained within a $matches array to see what we get for different regex patterns.

Test string

Image below shows the string variable that I will be testing with regular expressions in this article.

test string

Ranges

1) Number range

This pattern will find and return all number characters that fit into the 0-9 character range. The forward slash symbols represent the main delimiter that encapsulates the whole pattern. The square brackets and the - character designate that the pattern is a range of some sort. If we hadn't used the hyphen, the characters would be treated and matched as single characters in a string of characters.

Code

0-9-regex

Result

0-9-regex-scrn

2) Negate number range

This will find and return all the characters that do not fit into the 0-9 character range. The ^ meta character in this context means the negation of the pattern, so essentially it is saying "give me back all that doesn't belong into the 0-9 range", with whitespaces included.

Code

not-0-9-regex

Result

not-0-9-regex-scrn

3) Lowercase letters range

This will find and return all characters that fit into the a-z character range. This pattern is case sensitive and it will match only the lowercase letters in the a-z range so we have to have this in mind while using it.

Code

a-z-regex

Result

a-z-regex-scrn

4) Negate lowercase letter range

This will return all characters that do not fit into the lowercase a-z character range. Have in mind that this will also return the whitespaces and symbols, so literally everything that doesn't fit into the a-z range.

Code

not-a-z-regex

Result

not-a-z-regex-scrn

5) Uppercase letter range

Similar to a-z range pattern but this one will match only the uppercase letters.

Code

A-Z-regex-uppercase

Result

A-Z-regex-uppercase-scrn

6) Case insensitive letters range

This will return letters that fit into the lowercase a-z or A-Z character range. The letter i at the end of the pattern represents the case insensitive declaration.

Code

a-z-regex-cinsensitive

Result

a-z-regex-cinsensitive-scrn

Individual characters

7) Literal string values

This will match the first occurrence of a literal a-z pattern, so if we would have a-z sequence of characters anywhere in our string, this would match that part of the string. We see here how careful we should be with range and individual regular expressions, brackets that wrap around a set of characters make a whole lot a difference in what the final output of preg_match_all will be.

Code

a-z-single-regex

Result

a-z-single-regex-scrn

8) Literal number values

This pattern will match all occurrences of a literal 123 sequence in a string. We see here how careful we should be with the range patterns, because the square brackets that wrap around a set of characters make a whole lot a difference in what the final pattern will be. If we would wrap 123 in square brackets, so if the pattern would be [123], this pattern would match every number that corresponds to either 1, 2 or 3. That would give totally different result than the 123 sequence pattern that would match only sequential 123 numbers in that exact order. Just something to have in mind.

Code

123-regex

Result

123-regex-scrn

9) Non word characters

This pattern will match all individual non word characters in a string. Note that "non word" includes whitespaces also (besides most of the symbols).

Code

W-regex

Result

W-regex-screen

10) Word characters

This pattern will match all individual word characters in a string. This basically includes all the letters and numbers and excludes most of the symbols (it includes underscores however).

Code

small-w-regex

Result

small-w-regex-screen

11) Whitespaces

This pattern will match all individual space characters in a string.

Code

space regex

Result

space regex

12) Non whitespace characters

This will match all individual non space characters in a string.

Code

non space regex

Result

non space regex

Different combinations, quantifiers and groups

13) Using the word boundary metacharacter

This pattern will match the ipsum characters only if these form a whole word (so the word ipsum). The \b part of the pattern represents a word boundary metacharacter, and this is what tells the function that it is the the whole word that has to be matched for, without it the pattern would match every ipsum in any string that contains that sequence of characters, regardless of other characters in that string. The i designates that the pattern is case insensitive.

Code

word boundary regex

Result

word boundary regex

14) Using the OR metacharacter

This pattern will match all non word characters defined by \W condition, or all
numbers defined by a 0-9 range in a string. The OR metacharacter in the pattern is defined by the pipe character that separates two different character sets of the pattern.

Code

pipe regex

Result

pipe regex

15) Quantifiers

This will match all lowercase letters, but it will chunk the result into separate 2 letter items because
of the {2} quantifier that is defined at the end of the pattern.

Code

quantifier regex

Result

quantifier regex

16) Combine character range, quantifiers and word boundary metacharacters

Unlike the previous pattern that will match all lowercase letters, but return them chunked into 2 character sets, this pattern will match only whole words that contain exactly 2 characters that can be a-z letters. The \b character, that represents a word boundary, at the start and at the end of the pattern makes all the difference. So in this case preg_match_all will return an empty array in $matches because our string doesn't contain a single word that is 2 characters long.

Code

quantifier boundary regex

Result

quantifier boundary regex

17) Grouping characters

This pattern is going to match 1234 characters, but additionally grouped and separated by the group metacharacter (the parenthesis) in 2 separate groups, and returned as such by the preg_match_all function. First returned value will be the whole pattern at the $matches[0] position, and remaining two positions will be reserved for the character groups as defined by the pattern.

Code

group 1234 regex

Result

group 1234 regex

18) Combine grouping and or metacharacters

This pattern combines the pipe character (the OR metacharacter) to divide a pattern in two groups.
It separates the match values in a \W group and the [0-9] group of characters. In this case the first group will contain all non word characters (the \W part), and the second group will contain all numbers (the [0-9] part). Have in mind that all $matches array items have the length of the first item, which can look confusing at times in this type of behavior.

Code

pipe group regex

Result

pipe group regex

Our next article about Regular Expressions in PHP will demonstrate some real life examples of complex RegEx patterns. Happy coding.


Previous Post Most common WordPress (in)security issues
Comments (0)



Leave a Reply

Your email address will not be published. Required fields are marked *