June 24, 2014 magda piatkowska

Quick start regex for analysts: Part I

Tweet about this on TwitterShare on LinkedInShare on FacebookGoogle+Share on StumbleUponEmail to someone

…with much excitement and a tiny bit of nervousness, here I am with my first blog post on Coppelia! Starting with my beloved regex, this is a series of three posts. Thank you so much Simon Raper for having me here and I hope you guys will enjoy reading it at least as I enjoyed writing it!

Let’s do it…!

Learning and applying regex (short for regular expressions) can be a frustrating process. I have been there and spent many painful hours figuring it out. I’d now like to share what I’ve learnt in the hope that it will be useful to others.

What you’ll need
  • A Java enabled browser
  • An understanding of datasets in text format and different types of separators
  • A text editor (I am using TextWrangler, but any is OK, more on that below)


Why learn it?

You could use regex to convert a tab-delimited file into a comma-delimited file or to find duplicate words in a text, to recognize incorrect e-mail addresses or to remove a specific text elements. The power is in the speed. Regular expressions are really fast and easy to use across many programming languages and tools.

What is regex?

A regular expression (regex) is a set of symbols that describes a text pattern. Regular expressions are the formal language of these symbols. For us to make use of them they need to be interpreted by a regular expression processor. The processor uses those symbols to allow us to match, search, and manipulate text.

Regular expressions are not a programming language. Although they may seem similar because they are a formal language with a defined set of rules.

Most programming languages use regular expressions, however it contains no variables and you can’t add 2+2 (you can later guess what would happen if you typed that).

If you are now really into the topic, here is a visual history of Regex.

What is the regex engine ?

The engine is basically the thing that is processing regular expressions. It is a piece of software that matches the pattern described by regex to the string that you want to process. The regex engine is usually built into programming languages or bigger applications.

Applications that have a built in regex engine:

Operating systems

grep, egrep and sed in Unix, Linux and Mac OS X

Text editors

Applications to test your regular expressions

  • iPhone apps
  • RegexPal (JavaScript online tool)

Programming languages with regex support

  • Java
  • JavaScript
  • .NET
  • Perl
  • PHP
  • Python
  • R
  • Ruby
  • XML schema

Let’s get started

Regex basics are the key. Once you get your head around the meaning of each of the basic characters and special signs in regex you will be able to write your own more complex expressions. One of the biggest problems that you will face with regex is that the expressions might differ from environment to environment (that is what stopped me for a long time from understanding regex). The differences include different ways of invoking regex, types of special characters and their meaning. In the second part of this tutorial you will find a table summarizing some (but not all!) differences.

Tip

When you are Googling something use the search phrase along with the programming language you are working in to get the most relevant advice. Like this.

To start go to Regexpal.com (or any other regex tester). You can also just click on any example and it will take you to the output. Play away!

Start and end of the regular expression

In many languages it is necessary to mark the start and the end of the regular expression.

// - to mark the regular expression. E.g. /regex/

Tip

Different programming languages use different signs. e.g. R uses “” and many JavaScript based testers (like Regexpal) do not need //. To match the expression in R you can use:

grep("c.t", c("cat", "cut")

But in Regexpal, it is just:

c.t

Literal characters

Here is your first regular expression (Click on the code to see the result. I will use that going forward in examples):

high

Taa daa! What you see is what you get. This regex will match exactly the text in the regular expression.

Note that, in most of the languages, by default it will only match first left occurrence. To apply the pattern to the whole string we must set it to global by adding /g operator at the end.

/high/g

There is no need to do this in most of the testers as the default setting is global.

Regex is case sensitive. That is why high in Highclere castle wasn’t matched. It is also space sensitive (as space is a character) so to match it a space would need to be added. Check this out:

[space]

The spaces between words were matched.

Metacharacters

Metacharacters are characters with a special meaning. They are central to regex. The help describe a repeating pattern in the text.

Tip

Metacharacters can have more then one meaning.

Tip

The meaning might vary between the engines (So here is the difference between languages or applications mentioned before).

Wildcard metacharacter

. (dot)

It matches any character, one dot for one character (any character). If you just put in a dot and apply it to the previous text, this is what happens:

.

Because the regex is applied globally it will match each sign separately one by one.

Replacing one character with a dot is a good solution to deal with misspelled words.

r.n

Tip

Watch out for:
/9.00/
It will match 9.00, but also 9500 or 9T00.

Escaping metacharacters

\ (backslash)

Escaping characters are used to remove the special meaning from metacharacters. In the following example the wildcard (. (dot)) is preceded by a backslash. It is not longer a wildcard and regex interprets it literally.

r\.n

Tip

Escape only metacharacters, if you escape literal characters you might give them a different meaning (e.g. /\t/ is a tab)

Tip

Quotation marks are not metacharacters (unlike in many languages)

Tip

Spaces are spaces and are considered literal character

Character sets

In cases where we want to limit the range of signs to match, but include more then one sign, we can define a set.

[ is a beginning of the set
] is an end of the set

Regex will match any of the signs defined in the set in the position where the set is placed. For example, if we want to match “run” and “ran”, but no other three letter words starting with r and ending with n, we can do the following:

r[au]n

Tip

The character set still stands for a single position in the string (unless repeated).

To cover a range of signs, you can use a dash. For example [0-9] matches all digits between 0 and 9.[a-zA-Z] creates a set, that takes all lower case and all upper case letters.

r[a-z0-9]n

Negative character sets

^ (carat)

Tip

This metacharacter has few meanings (i.e. also an anchor).

The carat symbol is a negation that means: match any character other than for the characters following the carat. For example:

see[^nm]

Tip

Metacharacter inside the set is already escaped and does not need to be anymore, it is treated literally unless it is carat. Carat has to be escaped if we want a literal match.

see[\^mn]

Because carat was escaped, the negative meaning of it is gone too.

Shorthand character sets (classes)

\d (digit)
\w (word character, incl digits and _, but not - )
\s (whitespace)
\D (not digit)
\W (not word character)
\S (not whitespace)

These characters represent any character from the class.

For example \d represents any digit:

\d\d\d\d

Will match only a set of digits, like “1988”.

And:

\w\w\w\w

Will match both four letter words and digits.

All the metacharacters and special signs are already presenting a lot of opportunities. Most of the expressions that you will ever need could be written by using them. The regex will not be perfect and it will be a very long string. To learn how to simplify your expressions, check out the next part!

Tagged: ,

About the Author

magda piatkowska I left uni as a systems engineer to take up DBA and later various BI positions at eircom in Dublin, Ireland. I then moved from telco to the gaming industry to join Silicon Valley's Zynga. I built there an international insights and analytics team. The team specialised in real time insights delivery and developing machine learning capabilities. We focused on text mining algorithms in order to include customer feedback in product development, segmentation and recommendations. I am currently with Channel4 where we are building a cutting edge data science team. Also, strongly supporting girls in rocking the world of technology!

Machine Learning and Analytics based in London, UK