Introduction to Regular Expressions – a Better Way to Filter Google Analytics Data

Due to the fact that it’s easy to use and particularly that it’s free, in some circles, Google Analytics became a synonym for Web Analytics in general. However, even though it’s free, Google Analytics is a very powerful tool which can help you understand better the interaction of users with your website and, by that, improve your product or service.

This article can be useful for you if you already have some experience with the interface and reports in GA. This could mean even if you worked with it at the basic level and now you’d like to try some more complex options this tool is offering.

What are regular expressions?

You probably already tried to filter something, create a goal and saw a following image: regex

Among the filtering options, beside Exact, there is also RegEx. RegEx is shortened for Regular Expression. Regular expressions may seem terrifying at first, but with a little bit of exercise and practice you’ll pick them up quickly.

Some of the reasons why you should use regular expressions in Google Analytics include the possibility of setting complex goals, complex funnels, excluding a certain range of IP addresses from reports as well as filtering the data based on complex patterns in GA reports.

Regular expressions help us group the strings based on a certain pattern.

Before we dive in and start talking about the characters themselves and what do they represent, it is important to understand one fact when it comes to regular expressions. By default they work in such way that, according to defined rules, they group as many possible things as possible. This means that with this principle you are essentially trying to exclude something with the expression and not include. In order to make this more clear, take a look at the following example.

Let’s say someone mentioned your website on their blog which is parked on a free subdomain on Blogger service — blogspot.com and they put a link to your website. Depending on the country where this blog is read, the domain will be different. Instead of .com there will be the ccTLD of that country. For instance, .rs, .dk, .de, etc. Hence, when the user visit your website through this link, the Source in GA will be different depending on the user country.

In order to group all these visits we don’t need to list all the possible ccTLDs nor write a regular expression such as blogspot.(rs|dk|de), but simply put blogspot. However, this also means that we would include in our report all the visits that might come from the search engine www.searchblogspot.com and, therefore, we would have to exclude them by writing (^search)blogspot.

If this looks unfamiliar and complicated, already in next paragraph I’ll explain what exactly do these examples mean.

How to use regular expressions?

“ [ ] ” — square brackets help you make a list of characters out of which would should be found in the string

If we take a look into analytics report of a certain eCommerce website which has several jackets for sale and the only difference in the URLs is the number beside, for instance:

https://examplestore.com/men/jackets/jacket-1 https://examplestore.com/men/jackets/jacket-2

Let’s say there are 6 of those products (up to /jacket-6). We want to do an A/B test and on the first 3 products the “Add to Cart” button is on a different spot. One of the ways we could divide the first 3 products and second 3 products with regular expression would look like this:

/jacket-[123]/ /jacket-[456]/

This grouping we will use to compare two groups of pages and see which position of the “Add to Cart” button is an optimal one i.e led to a better conversion rate.

“ – ” — dash itself is pretty much intuitive and it helps us create a range. For instance, if we would like to group everything that contains one of the numbers from 0 to 9 it would look something like this: [0-9]. And if we go back to our previous example and want to choose the same groups of jackets the regular expression would look like this:

/jacket-[1-3]/ /jacket-[4-6]/

Of course, in this particular case we are not saving time nor characters, but this way you can get a much bigger range:

[0-9] — groups all occurrences of one of the numbers from 0 to 9 [a-z] — groups all occurrences of one of the lowercase letters from a to z [A-Z] —groups all occurrences of one of the uppercase letters from a to z [a-Z] —groups all occurrences of one of the lower or uppercase letters from a to z in English alphabet

“ . ” — the dot substitutes one occurrence of any character and therefore .uk would be correct for kuk, luk, puk, etc. If we would like to exclude IP addresses where last digit could be from 0 – 9 it could look like this:

192.168.0..

Of course, the dot means that this expression would be correct also for 192.168.0.u, but such phenomenon is impossible here and would be, therefore, safe to use such expression as well.

“ * ” — the star is one of the characters that is often used in wrong context, because people usually think it’s a wildcard and would be correct for “everything that comes afterwards”, but it’s not like this. Essentially, its role is to group any number of occurrences of previous character, including 0 occurrences. Therefore, regular expression co*l, would match col, cool, but also cl.

“ . “ — dot and star don’t represent a character for itself, but they are often used in combination and would be worth of mentioning. If we imagine we have access to YouTube analytics reports and we would like to see how many people visited Live channels. URL for the live channels looks like this:

https://www.youtube.com/c/channel-name/live

If we would like all URLs that end with /live, and do not want www.youtube.com/live, regular expression would look like this:

/c/.*/live

Let’s say we are working with an online store. On one hand we have a destination goal where when a customer makes a purchase he is redirected to a page /jackets/thanks-for-shopping, and then for t-shirts /t-shirts/thanks-for-shopping, and same for other categories as well. And, on the other hand we would like to have all such transactions as one goal as well. Regular expression for this example would look like this:

/.*/thanks-for-shopping

“ ? “ — question mark practically says: last character is optional. In English language the example for this would be when we would like to group some words no matter whether it’s British or American version and example words would be colour, behaviour, and others for which the American version doesn’t have letter “u”. Therefore, a regular expression for each of these would be colou?r, behaviou?r or favou?rite which would group both occurrences.

“ + “ — plus sign has a similar role as the question mark, however, while ? groups all strings where the previous character exists or not, its role is to group all occurrences of one or more of the previous character i.e is there at least one occurrence of the previous characters. Therefore, if we would take our previous example and substitute ? with + and have colou+r, we would group colour, colouur, but not color.

“ { } “ — curly brackets are used in order to group the certain number of occurrences of previous characther. Therefore, RegEx “a{2}” would mean two occurrences of letter “a”. We could also create a range which would look like: a{2,5} and would mean that the last character occurs at least twice, but not more than 5 times. Therefore, if we would write an expression co{2,5}l we would group words cool, coool, cooool, coooool, but not col nor cooooool, because “o” appears only once and six times, respectively.

“ ( ) ” — parenthesis work exactly just like in math and it’s to group elements. Previously I sad that with square brackets we can check for an occurrence of one from the listed characters, however, parenthesis allow us to check for the occurrence of a string. Again, if we would get back to an online store example where we have separate categories for male and female clothes, it would look like this:

/(|fe)male/

“ ^ “ — caret sign means that whatever we would like to group starts with this. For instance, ^pre would group following words: prequel, preview, prewash, etc. This sign can also means “not” if we would use it in combination with any of the brackets. Therefore [^b] would match:

*examplestore.com/category-a/ , but not examplestore.com/category-b/

“ $ “ — dollar sign means that the strings ends with this. For instance, tion$ would match all the words that end with -tion such as destination, promotion, traction, etc. It would be intuitive to think that if we would like to have only homepage in our report we should put “/$“, however this would match all the pages that don’t have any dynamic parameters, because by default they all end with /.

The correct RegEx which would match only the homepage uses the caret sign as well and looks like this: ^/$ which essential means that it should group all the pages that start with and end with sign /.

“ | ” — pipe, as in many other cases represents or. Where we have a situation where you simply cannot write an expression which would match all the pages you need to match then you can simply use the pipe. For instance, with the above-mentioned example of URLs with jackets, regular expression could look like this:

/jacket-(1|2|3)/$ /jacket-(4|5|6)/$

“ \ ” — backslash helps you neutralize one character in the regular expression. One of the examples would be if you would like to exclude your IP address from Google Analytics in order to remove all the noise from the data. Let’s say your IP address is 192.168.1.254. Since the dot has its function as a RegEx character, in order to filter this IP using regular expressions it would look like this: 192.168.1.254

Other characters that might be helpful:

\s — space
\S — non-space
\d — number
\D — non-number
\w — any alphanumeric or _
\W — not alphanumeric or _

SPAM filtering

One of the more advanced examples of using regular expressions in Google Analytics would be filtering the SPAM we were attacked by in the previous year, mostly in the ending months. The difference between SPAM and other visits was mostly visible when we look at the “Language” dimension and therefore, our filtering choice falls exactly there.

There were a lot of solutions to this problem on internet, but majority of them was for some reason unnecessarily complicated. The regular expression from this image groups and excludes all the languages where there are more than 15 characters. Neither of the language codes has more than 15 characters and, for example, for the SPAM that was coming to startit.rs, there were no shorter than this number.

In order to make learning RegEx easier, you can use Regex Pal where you can check what your expression matches and once you become more advanced feel free to visit RexEgg, the coolest Regular Expression website on Earth.

Also, I would suggest you that, as a reminder, you keep all the characters in a table with a short explanation for each, and maybe also as a reference place for brainstorming a more complex patterns.

I already made a table for you and, if you’d like, you can download it here.

I originally wrote this article in Serbian for the organization I worked for, Startit, and it can be found here.