Rule-based classifications (Part 4: Using regular expressions)
This is Part 3 in a multi-part series describing the new classifications rule builder in Adobe Analytics. If you missed any of the previous installments they can be found here:
Advanced pattern matching in classifcation rules
In my last post I mentioned that the classifications rule builder offers four types of matching conditions for classification rules:
- Starts With
- Ends With
- Regular Expression
The first three types are easy to use but can be limiting. For example, suppose I am using the rule builder to classify tracking codes which are of this form:
Let’s assume the first section of the string will be used to set a Channel classification, the second section will be used to set a Campaign Name classification, and the third section will be used to set a Campaign Date classification. It’s pretty easy to see how you could use Starts With to handle the first section of the string:
If Starts With ’em:’ then set Channel to ‘Email’
You probably have relatively few channels so creating individual “Starts With’ rules to handle each channel is no problem. But handling the second and third sections of the tracking code is tricky:
- You will likely have many campaign names and dates and you probably don’t want to have to create a rule for every possible name and date combination.
- Using Contains may lead to results you don’t expect. In the example above if you use a rule that says “If Contains ‘May’ then set Campaign Date to ‘May'”, you’ll end up mis-clasifying the tracking code.
This is where regular expressions come to the rescue.
What are regular expressions?
Wikipedia defines regular expressions this way:
A regular expression is a specific pattern that provides concise and flexible means to “match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.
Some of you are probably familiar with the concept of “wildcards” that are used in string searches. In Windows, for instance, you can use a question mark (?) to represent any single character and an asterisk (*) to represent any string of characters:
Think of regular expressions as wildcard searches on steroids. Regular expressions are so powerful that it can take a while to learn how to fully leverage their capabilities, but learning the basics is pretty easy. For example, here are a few of the commonly used search parameters in regular expressions:
||Any whitespace character|
||Any single character|
||Any non-whitespace character|
||Any word character (letter, number, underscore)|
||Any non-word character|
||Any word boundary|
||Capture everything enclosed as a parameter|
||a or b|
||Zero or one of a|
||Zero or more of a|
||One or more of a|
||Exactly 3 of a|
||3 or more of a|
||Between 3 and 6 of a|
||A single character of: a, b or c|
||Any single character except: a, b, or c|
||Any single character in the range a-z|
||Any single character in the range a-z or A-Z|
||Start of line|
||End of line|
||Start of string|
||End of string|
In the classifications rule builder you can use regular expressions to match a wide variety of text, characters, words and patterns and use them to set classification columns: Continuing the tracking code example I started with earlier in this post, let’s suppose I set up 3 rules that look like this:
If Starts With
em: set Channel =
If matches regex
^([^:]+):([^:]+):([^:]+)$ set Campaign Name =
If matches regex
^([^:]+):([^:]+):([^:]+)$ set Campaign Date =
Holy cow what does all that mean? Let’s take a closer look:
First, notice I used the same regular expression in both the second and third rules but set a different classification in each case. This regular expression matches any string (tracking code in this case) that starts with one or more non-colon characters, then a colon, then more non-colon characters, then a colon, then more non-colon characters. Note that the regular expression could easily be modified for use with any delimiter.
Second, I used parentheses to define portions of the string as parameters that I can use in my classification rules. Since I have three sets of parentheses I have defined three parameters: $1, $2, and $3. Based on my rules if my tracking code is:
then the second and third rules will set:
Campaign Name = MaybellineSale
Campaign Date = June
What’s more, the regular expression I’ve chosen will work regardless of what substrings occur in the second and third sections of the tracking code. Sweet! Two rules to rule them all.
Amazing. Where can I learn more?
The online documentation for the classifications rule builder contains a short primer on regular expressions and several examples of use cases that came up during our beta testing. For a full treatise on regular expressions I recommend this site: http://www.regular-expressions.info/
I hope you’ve found this series of blog posts helpful. Feel free to leave comments and questions below. I’m also interested in hearing from you if you come up with a really great regular expression for the rule builder that you’d like to share with the rest of the world.