July 17, 2014 magda piatkowska

Quick start regex for analysts: Part III Cleaning up a messy dataset

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

In my third post on regex skills for analysts we’ll look at how we put the skills we learnt in part I and part II to use with a simple exercise.

We are often faced with a messy datasets in a text format. The easiest way to manipulate them is using a text editor (if they fit). I use TextWrangler, but it can be any editor.

We have a sample data set in a text file. It is formated as follows (I am using a sample from IMDB database):

“‘Allo ‘Allo!” (1982) {Firing Squashed (#8.5)} UK
“‘Allo ‘Allo!” (1982) {Fleeing Monks (#7.3)} UK
“‘Allo ‘Allo!” (1982) {Flight of Fancy (#3.4)} UK
“‘Allo ‘Allo!” (1982) {Forged Francs & Fishsellers (#5.15)} UK

Instead of all the brackets, tabs and adverted commas, we would like to have four neat comma separated columns: Title, Episode Title, Production Year and Country. Here are the steps to achieve that:

Step 1

Open the file in TextWrangler. Go to the Find and Replace window. There should be an option that you need to tick (“regular expressions” or “grep”). This will enable regex matching. You can use now the Replace and Replace with fields to type a regular expression.

Step 2

The first step to achieve the required change is to remove the numbers in brackets from the episode title (click on the code to test the match in Regexpal):

\s\(#\d.\d{1,2}\)

This will find and replace it with an empty string.

Step 3

Replace the curly brackets, brackets, spaces and tabs between the fields with comma

("\s*\()|(\)\s*{)|(}\s*)

Step 4

Remove the ” at the beginning of the line

^”

This will find and replace it with an empty string.

Step 5

Match the columns “Production year” and “Episode title”. Divide them into two sets, so we can reorder them

(\d{4},)(.*,)

Step 6

Replace with reordered sets. Second comes first before set 1.

\2\1

Done! You can now do regex.

Tagged: ,

About the Author

magda piatkowska I left uni as a systems engineer to take up DBA and later various BI positions at eircom in Dublin, Ireland. I then moved from telco to the gaming industry to join Silicon Valley's Zynga. I built there an international insights and analytics team. The team specialised in real time insights delivery and developing machine learning capabilities. We focused on text mining algorithms in order to include customer feedback in product development, segmentation and recommendations. I am currently with Channel4 where we are building a cutting edge data science team. Also, strongly supporting girls in rocking the world of technology!

Comments (2)

Leave a Reply

Your email address will not be published. Required fields are marked *

Machine Learning and Analytics based in London, UK