The Delicious Truth about Regular Expressions (regex)

June 24, 2025
PHP
0

Wait! Don’t run away! I know the whole thought of a regex terrifies some of you (it made me cringe now some months ago) but when I sat down and figured it out I realized how amazing they are and… well… I reacted accordingly:

tl;dr Regexes are amazing because ‘/<[^<]+?>/’. Enough said.

===========================================

Let’s start at the beginning, if you’re still reading:

What are Regexes?

A regular expression is a pattern. Yes, a pattern, like the pattern you see in a string such as ‘111234555678…’ but with variables as a possibility for pattern matching rather than just a static pattern you declare.

Example

Not only can you match ‘111’ or ‘555’ in that previous string, but you can match groups of three digits using the pattern: ‘/(\d)\1{2}/’

What that pattern does is match (using a capture group) a single digit and then checks for two more iterations of that digit before considering it a match. Let’s take it apart to marvel at it, shall we?

Parentheses in regexes are considered capture groups, that is essentially a way of selecting a match to be used later in the match or a replacement string.
How would you group without capturing? Use ‘?:’ immediately after you open the parentheses to tell the interpreter it is NOT a capture group.
‘\d’ is the pattern match for a single digit.
‘\1’ is a pattern match for the result of the first capture group.
‘{2}’ after ‘\1’ is a quantifier specifying EXACTLY two matches.

Isn’t that wonderful? Yes, yes it is. ONTO MORE DETAILED EXPLANATION!

Enough bragging, how do I use them?

First off we need to go through how a pattern is constructed, so have a little patience.
Note the below terminology is not something I was taught, just how I learned and see it. Hopefully it makes sense to you. Also, my guide may be incomplete.
For reference, I learned most of what I know by tinkering with http://gskinner.com/RegExr/. Thank you, kind sir who built that tool.

Every pattern has a token, a quantifier, and a group regardless of whether they’re explicitly stated or not.

Token

A token is a way of matching a character. These tokens follow certain logical standards that I feel I should mention. A uppercase letter is the negation of the token, whereas a lowercase letter is the token itself. Here are some tokens and some examples will follow once all components are explained:

\w == a ‘word’ character (any character used in a word, not just letters)

\d == a digit character

\s == a space character

\b == word boundary (doesn’t work all that well)

\W == non-word

\D == non-digit

\S == non-space

\B == non-boundary

\1 == the first capture group’s result

. == any character

^ == the beginning of the string

$ == the end of the string

Characters that must be escaped with a ‘\’ to match (separated by spaces): \ . + * ? ^ $ [ ] ( ) | { } / ‘ #

To escape a ‘$’ symbol (not matching) use ‘$$’

Quantifiers

A quantifier is a way of matching how many of the previously expressed token’

? == matches 0 or 1 of the previous token

+ == matches 1 or more of the previous token (greedy, grabs most characters)

* == matches 0 or more of the previous token (greedy, grabs most characters)

+? or *? == matches 0 or more of the previous token (lazy, grabs fewest characters)

The Delicious Truth about Regular Expressions (regex)