

The Delicious Truth about Regular Expressions (regex)
Wait! Don’t run away! I know the whole thought of a regex terrifies some of you (it made me cringe now some months ago) but when I sat down and figured it out I realized how amazing they are and… well… I reacted accordingly:
tl;dr Regexes are amazing because ‘/<[^<]+?>/’. Enough said.
===========================================
Let’s start at the beginning, if you’re still reading:
What are Regexes?
A regular expression is a pattern. Yes, a pattern, like the pattern you see in a string such as ‘111234555678…’ but with variables as a possibility for pattern matching rather than just a static pattern you declare.
Example
Not only can you match ‘111’ or ‘555’ in that previous string, but you can match groups of three digits using the pattern: ‘/(\d)\1{2}/’
What that pattern does is match (using a capture group) a single digit and then checks for two more iterations of that digit before considering it a match. Let’s take it apart to marvel at it, shall we?
Parentheses in regexes are considered capture groups, that is essentially a way of selecting a match to be used later in the match or a replacement string.
How would you group without capturing? Use ‘?:’ immediately after you open the parentheses to tell the interpreter it is NOT a capture group.
‘\d’ is the pattern match for a single digit.
‘\1’ is a pattern match for the result of the first capture group.
‘{2}’ after ‘\1’ is a quantifier specifying EXACTLY two matches.
Isn’t that wonderful? Yes, yes it is. ONTO MORE DETAILED EXPLANATION!
Enough bragging, how do I use them?
First off we need to go through how a pattern is constructed, so have a little patience.
Note the below terminology is not something I was taught, just how I learned and see it. Hopefully it makes sense to you. Also, my guide may be incomplete.
For reference, I learned most of what I know by tinkering with http://gskinner.com/RegExr/. Thank you, kind sir who built that tool.
Every pattern has a token, a quantifier, and a group regardless of whether they’re explicitly stated or not.
Token
A token is a way of matching a character. These tokens follow certain logical standards that I feel I should mention. A uppercase letter is the negation of the token, whereas a lowercase letter is the token itself. Here are some tokens and some examples will follow once all components are explained:
\w == a ‘word’ character (any character used in a word, not just letters)
\d == a digit character
\s == a space character
\b == word boundary (doesn’t work all that well)
\W == non-word
\D == non-digit
\S == non-space
\B == non-boundary
\1 == the first capture group’s result
. == any character
^ == the beginning of the string
$ == the end of the string
Characters that must be escaped with a ‘\’ to match (separated by spaces): \ . + * ? ^ $ [ ] ( ) | { } / ‘ #
To escape a ‘$’ symbol (not matching) use ‘$$’
Quantifiers
A quantifier is a way of matching how many of the previously expressed token’
? == matches 0 or 1 of the previous token
+ == matches 1 or more of the previous token (greedy, grabs most characters)
* == matches 0 or more of the previous token (greedy, grabs most characters)
+? or *? == matches 0 or more of the previous token (lazy, grabs fewest characters)