Quick start regex for analysts: Part III Cleaning up a messy dataset

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In my third post on regex skills for analysts we’ll look at how we put the skills we learnt in part I and part II to use with a simple exercise.

We are often faced with a messy datasets in a text format. The easiest way to manipulate them is using a text editor (if they fit). I use TextWrangler, but it can be any editor.

We have a sample data set in a text file. It is formated as follows (I am using a sample from IMDB database):

“‘Allo ‘Allo!” (1982) {Firing Squashed (#8.5)} UK
“‘Allo ‘Allo!” (1982) {Fleeing Monks (#7.3)} UK
“‘Allo ‘Allo!” (1982) {Flight of Fancy (#3.4)} UK
“‘Allo ‘Allo!” (1982) {Forged Francs & Fishsellers (#5.15)} UK

Instead of all the brackets, tabs and adverted commas, we would like to have four neat comma separated columns: Title, Episode Title, Production Year and Country. Here are the steps to achieve that:

Step 1

Open the file in TextWrangler. Go to the Find and Replace window. There should be an option that you need to tick (“regular expressions” or “grep”). This will enable regex matching. You can use now the Replace and Replace with fields to type a regular expression.

Step 2

The first step to achieve the required change is to remove the numbers in brackets from the episode title (click on the code to test the match in Regexpal):

\s\(#\d.\d{1,2}\)

This will find and replace it with an empty string.

Step 3

Replace the curly brackets, brackets, spaces and tabs between the fields with comma

("\s*\()|(\)\s*{)|(}\s*)

Step 4

Remove the ” at the beginning of the line

^”

This will find and replace it with an empty string.

Step 5

Match the columns “Production year” and “Episode title”. Divide them into two sets, so we can reorder them

(\d{4},)(.*,)

Step 6

Replace with reordered sets. Second comes first before set 1.

\2\1

Done! You can now do regex.

Quick start regex for analysts: Part II

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In my previous post (Part I) I went over the basic metacharacters and special signs in regex. In this second part I will be showing you how to simplify regular expressions.

Let’s get started

Repetition metacharacters

So far, we looked at expressions that always match the pattern to a single position in the text. Repetition metacharacters make the expression much more flexible by expanding the pattern to a specified number of characters.

‘* (proceeding item zero or more times)
+ (proceeding item one or more times)
? (proceeding item zero or one time)

For example (click on the code to see how it works):

apples*

will match the word with no “s” as well as one or many “s”. But in:

apples+

the ‘s’ must be there so it will match words with at least one “s”.

Whereas:

apples?

The ‘s’ doesn’t have to be at the end of the string but it can’t be repeated.

Quantified repetition

This works similarly to the repetition metacharacters. The difference is that we can specify exact number of repetitions of the sign.

{ - start of quantified repetition
} – end of the repetition

Syntax:

{min,max} (min and max are positive numbers)

Tip

min must always be there even if it is 0. Max and the coma are optional.

In our previous post we had an example where we wanted to match only the year. We can now do the following:

\d{4}

This represents a digit repeated exactly 4 times and can be used instead of typing \d\d\d\d.

Similarly we find:

\w{5,10} - minimum 5 letter and maximum 10 letter word
\w{5,} - minimum 5 letters word
\w{5} - exactly 5 letters word

Let’s say we are trying to pick out IP addresses from the text. An IP address is a sequence of 4 sets of 1 – 3 digits separated by dots. It can be easily expressed by:

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The .(dot) had to be escaped to be interpreted literally and each set of digits is repeated a minimum of one and a maximum of three times. Shortly we will learn how to simplify it even further.

Tip

There are usually many ways to write a regex expression matching your needs and no one perfect way! As long as it does the job and you are sure it matches exactly what you want don’t worry too much about what it looks like!

Grouping metacharacters

Using () around the groups of characters will enable repetation of that group.

Tip

Don’t group in the character sets (within []) as () will have the literal meaning.

(what)+

will match “what” one or more times.

And to match words with or without the prefix we use:

(in)?dependent

Coming back to the IP address example. We can group each 1-3 digits and an optional dot and then repeat it 4 times.

(\d{1,3}\.?){4}

This will fully match any IP address.

Alternation metacharacters

A common way of dealing with an incorrect spelling is by using the OR character and then grouping the two (or more) alternatives. This way we don’t need to repeat the [] sets.

| - (previous OR next expression)

Take for example the commonly misspelled word:

w(ei|ie)rd

Anchors

Anchors signify the position of the pattern in the text. Note, that it is a second meaning of carat (it is also a negation, check it out here).

^ (start of string/line)
$ (end of string or line)
\A (start of string, never end of line)
\Z (end of string, never end of line)

For example:

^apple

will match only ‘apple’ at the beginning of the line.

Lookaround assertions

Bear in mind that these expressions differ significantly in the different variants of regex.

?= (Assertion of what ought to be ahead)
?! (negative lookahead)
?<= (positive look behind assertion, what ought to be behind)
?<!-- (negative look behind)

For example:

(?=seashore)sea

Will match “sea” only if it is followed by “shore”.

Tip

Look behind can’t be used with repetitions or optional expressions. It also doesn’t work in JavaScript (hence can’t be tested in regexpal).

Tip

It tends not to work very well in text editors.

Differences between programming languages

Here is a quick summary of the major differences between regex in different programming languages. It is not exhaustive, but will give you the idea of the scope of differences. Again, it is always good to let Google know what language you are working in while searching for regex solutions.

Regex Ruby Java Perl Python/R Unix JavaScript PHP .NET
Character Classes :(e.g. \d; \w) Yes No Yes Yes No Yes Yes Yes
POSIX bracket expressions Yes No Yes No Yes No Yes No
Quantifiers: * Yes Yes Yes Yes Yes Yes Yes Yes
Quantifiers: + and ? Yes Yes Yes Yes No Yes Yes Yes
Anchors: \A and \Z Yes Yes Yes Yes No No Yes Yes
Line break: /m Yes No Yes No Yes Yes Yes No
Special command for line break No Yes No Yes No No No Yes
Lookaround assertions only 1.9 and above Yes Yes Yes No No Yes Yes

My next post is all about using regex in a real life example!

Quick start regex for analysts: Part I

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

…with much excitement and a tiny bit of nervousness, here I am with my first blog post on Coppelia! Starting with my beloved regex, this is a series of three posts. Thank you so much Simon Raper for having me here and I hope you guys will enjoy reading it at least as I enjoyed writing it!

Let’s do it…!

Learning and applying regex (short for regular expressions) can be a frustrating process. I have been there and spent many painful hours figuring it out. I’d now like to share what I’ve learnt in the hope that it will be useful to others.

What you’ll need
  • A Java enabled browser
  • An understanding of datasets in text format and different types of separators
  • A text editor (I am using TextWrangler, but any is OK, more on that below)

Read more

Machine Learning and Analytics based in London, UK