July 2, 2014 magda piatkowska

Quick start regex for analysts: Part II

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In my previous post (Part I) I went over the basic metacharacters and special signs in regex. In this second part I will be showing you how to simplify regular expressions.

Let’s get started

Repetition metacharacters

So far, we looked at expressions that always match the pattern to a single position in the text. Repetition metacharacters make the expression much more flexible by expanding the pattern to a specified number of characters.

‘* (proceeding item zero or more times)
+ (proceeding item one or more times)
? (proceeding item zero or one time)

For example (click on the code to see how it works):

apples*

will match the word with no “s” as well as one or many “s”. But in:

apples+

the ‘s’ must be there so it will match words with at least one “s”.

Whereas:

apples?

The ‘s’ doesn’t have to be at the end of the string but it can’t be repeated.

Quantified repetition

This works similarly to the repetition metacharacters. The difference is that we can specify exact number of repetitions of the sign.

{ - start of quantified repetition
} – end of the repetition

Syntax:

{min,max} (min and max are positive numbers)

Tip

min must always be there even if it is 0. Max and the coma are optional.

In our previous post we had an example where we wanted to match only the year. We can now do the following:

\d{4}

This represents a digit repeated exactly 4 times and can be used instead of typing \d\d\d\d.

Similarly we find:

\w{5,10} - minimum 5 letter and maximum 10 letter word
\w{5,} - minimum 5 letters word
\w{5} - exactly 5 letters word

Let’s say we are trying to pick out IP addresses from the text. An IP address is a sequence of 4 sets of 1 – 3 digits separated by dots. It can be easily expressed by:

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The .(dot) had to be escaped to be interpreted literally and each set of digits is repeated a minimum of one and a maximum of three times. Shortly we will learn how to simplify it even further.

Tip

There are usually many ways to write a regex expression matching your needs and no one perfect way! As long as it does the job and you are sure it matches exactly what you want don’t worry too much about what it looks like!

Grouping metacharacters

Using () around the groups of characters will enable repetation of that group.

Tip

Don’t group in the character sets (within []) as () will have the literal meaning.

(what)+

will match “what” one or more times.

And to match words with or without the prefix we use:

(in)?dependent

Coming back to the IP address example. We can group each 1-3 digits and an optional dot and then repeat it 4 times.

(\d{1,3}\.?){4}

This will fully match any IP address.

Alternation metacharacters

A common way of dealing with an incorrect spelling is by using the OR character and then grouping the two (or more) alternatives. This way we don’t need to repeat the [] sets.

| - (previous OR next expression)

Take for example the commonly misspelled word:

w(ei|ie)rd

Anchors

Anchors signify the position of the pattern in the text. Note, that it is a second meaning of carat (it is also a negation, check it out here).

^ (start of string/line)
$ (end of string or line)
\A (start of string, never end of line)
\Z (end of string, never end of line)

For example:

^apple

will match only ‘apple’ at the beginning of the line.

Lookaround assertions

Bear in mind that these expressions differ significantly in the different variants of regex.

?= (Assertion of what ought to be ahead)
?! (negative lookahead)
?<= (positive look behind assertion, what ought to be behind)
?<!-- (negative look behind)

For example:

(?=seashore)sea

Will match “sea” only if it is followed by “shore”.

Tip

Look behind can’t be used with repetitions or optional expressions. It also doesn’t work in JavaScript (hence can’t be tested in regexpal).

Tip

It tends not to work very well in text editors.

Differences between programming languages

Here is a quick summary of the major differences between regex in different programming languages. It is not exhaustive, but will give you the idea of the scope of differences. Again, it is always good to let Google know what language you are working in while searching for regex solutions.

Regex Ruby Java Perl Python/R Unix JavaScript PHP .NET
Character Classes :(e.g. \d; \w) Yes No Yes Yes No Yes Yes Yes
POSIX bracket expressions Yes No Yes No Yes No Yes No
Quantifiers: * Yes Yes Yes Yes Yes Yes Yes Yes
Quantifiers: + and ? Yes Yes Yes Yes No Yes Yes Yes
Anchors: \A and \Z Yes Yes Yes Yes No No Yes Yes
Line break: /m Yes No Yes No Yes Yes Yes No
Special command for line break No Yes No Yes No No No Yes
Lookaround assertions only 1.9 and above Yes Yes Yes No No Yes Yes

My next post is all about using regex in a real life example!

Tagged:

About the Author

magda piatkowska I left uni as a systems engineer to take up DBA and later various BI positions at eircom in Dublin, Ireland. I then moved from telco to the gaming industry to join Silicon Valley's Zynga. I built there an international insights and analytics team. The team specialised in real time insights delivery and developing machine learning capabilities. We focused on text mining algorithms in order to include customer feedback in product development, segmentation and recommendations. I am currently with Channel4 where we are building a cutting edge data science team. Also, strongly supporting girls in rocking the world of technology!

Comments (2)

Leave a Reply

Your email address will not be published. Required fields are marked *

Machine Learning and Analytics based in London, UK