Ikai Lan says

I say things!

Using pattern matching with regular expressions in Scala

with 11 comments

I’ve been trying to use Scala more and more so I can gain some experience and exposure to it. A couple of weeks ago, I wrote a Scala log parser for Ruby on Rails. It is terribly newbie-ish – the classes are mutable and it’s disorganized. It’s a mess. Jorge Ortiz from the Scala mailing list was kind enough to rewrite it in a more Scala style. It completely blew my mind how terse Scala can become when written correctly.

It bothered me, however, dealing with regular expressions the way that I did. The Java interface is pretty clumsy and nowhere near as clean as regular expression pattern extraction in Perl or Ruby.

As it turns out, it’s surprisingly easy to extract text using Regular Expressions in Scala. Throw away Pattern.compile! Check out this hotness below:

First, let’s import Scala’s regex package:

import scala.util.matching.Regex

Now we declare a regular expression to match against. We can do this one of two ways:

val LogEntry = new Regex("""Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""")

I use triple quotes here to signify that I am creating a raw string. A raw string means that I do not need to escape characters like the \ character. If I didn’t do this, I’d be forced to use strings like “\\d+”. Believe it or not, that extra slash throws me off. Just goes to show that I have written way too many parsers.

Alternatively, I can declare a new Regex by doing this:

val LogEntry = """Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""".r

Strings have a method called “r”, which will convert it to a Regex object. I’m not sold on this syntax at the moment, since it doesn’t play well with eyeball scans, but I’m putting it here for those folks that absolutely need to save characters.

There’s nothing really special here yet. The next step is REALLY cool:

val line = "Completed in 100ms (View: 25, DB: 75) | 200 OK [http://app.domain.com?params=here]"

scala> val LogEntry(totalTime, viewTime, dbTime, responseCode, uri) = line
totalTime: String = 100
viewTime: String = 25
dbTime: String = 75
responseCode: String = 200
uri: String =

The local variables totalTime, viewTime, dbTime, responseCode and uri are now bound to the values we want to extract from the original line! The regular expression value defines an unapplySeq method. I’m not quite good enough at Scala to tell you in any definite terms what that means, except that you can use the code in a pattern match:

line match {
case LogEntry(totalTime, viewTime, dbTime, responseCode, uri) => {
/* Process the data */
// do something with totalTime.toInt
// do something with viewTime.toInt
// etc ...
}
case _ => // Do nothing
}

Because you can use a pattern match, and patterns will be be matched in the order of definition, this means that you can create several regular expressions representing lines you want to extract, then process them easily in using pattern matching.

Pretty powerful stuff. What would really make my day would be if someone knew how I could extract the values totalTime, viewTime, and dbTime as integers and not have to do a conversion – I’m already matching with \d+. Ideas?

About these ads

Written by Ikai Lan

April 4, 2009 at 5:41 pm

Posted in Scala

Tagged with , , ,

11 Responses

Subscribe to comments with RSS.

  1. Hi, here is some info about the unapply magic: http://www.scala-lang.org/node/112

    I am puzzling over returning integers though…

    Channing Walton

    April 5, 2009 at 1:03 pm

  2. Cool! I had no idea that Regex let you pattern match, which is pretty awesome.

    Unfortunately, you can’t make Regex return Ints directly. Types are a compile-time thing, and your pattern can be any arbitrary String constructed at run-time, so in general there’s no way for the type-system to know what types of things you’ll be matching on. However, with a little compile-time help you can achieve what you want:

    val le = “””Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*”””.r
    val line = “Completed in 100ms (View: 25, DB: 75) | 200 OK [http://app.domain.com?params=here]”
    object LogEntry {
    def unapply(line: String): Option[(Int, Int, Int, String, String)] =
    for (List(ttime, vtime, dtime, rc, uri) <- le.unapplySeq(line))
    yield (ttime.toInt, vtime.toInt, dtime.toInt, rc, uri)
    }
    val LogEntry(totalTime, viewTime, dbTime, responseCode, uri) = line

    totalTime: Int = 100
    viewTime: Int = 25
    dbTime: Int = 75
    responseCode: String = 200
    uri: String =

    Jorge Ortiz

    April 5, 2009 at 4:11 pm

  3. Jorge, what’s the advantage of the for comprehension over something less cerimonial?

    object LogEntry {
    def unapply(line: String) = {
    val le(a,b,c,d,e) = line
    Some(a.toInt, b.toInt, c.toInt, d, e)
    }
    }

    jherber

    April 5, 2009 at 6:54 pm

  4. You could do something like:

    object I {
    def unapply(s: String): Option[Int] =
    try { Some(s.toInt) }
    catch { case _: NumberFormatException => None }
    }

    val LogEntry(I(totalTime), I(viewTime), I(dbTime), I(responseCode), uri) = line

    totalTime: Int = 100
    viewTime: Int = 25
    dbTime: Int = 75
    responseCode: Int = 200
    uri: String =

    Mark Harrah

    April 8, 2009 at 6:52 am

  5. [...] matching extrahieren bsp [...]

  6. If (\d+) returns something that throws NumberFormatException, I think you’d want to crow about it pretty loudly.

    jpitts

    January 25, 2010 at 12:39 pm

  7. The syntax you have mentioned is not available in the recent versions of Scala.

    See:

    http://www.scala-lang.org/node/122

    http://fileit.in/p/7

    Subhash Chandran

    February 26, 2010 at 4:11 am

  8. Who is ”you”, Subhash? Which code doesn’t work?

    scala&gt; val alnu = &quot;&quot;&quot;foo ([a-zA-Z]+) ([0-9]+)&quot;&quot;&quot;.r 
    alnu: scala.util.matching.Regex = foo ([a-zA-Z]+) ([0-9]+)
    
    scala&gt; val alnu (str, num) =  &quot;foo bar 1234&quot;       
    str: String = bar
    num: String = 1234
    

    works for me (2.8 final).

    (a small hint on the website, how to format code, bold, cite, italic would be nice)

    I don’t understand the syntax. Normally, in the REPL, if I write

    val a = …
    val a = …

    the second a is hiding the first a – it’s a new declaration, and I may write

    val a = 7
    val a = “seven”

    and the second line is, as if there never would have been the first line.

    Stefan W.

    August 7, 2010 at 8:05 am

  9. I think it would be important to to point out “””(?m) blab””” (?m) makes the regex a multiline and more importantly “””(?s).*mom.*””” activates dotall matching (ie, matching “.” to new lines) which is very useful and hard to find in the doc and examples.

    Olin Atkinson

    June 24, 2011 at 7:07 am

  10. syntax highlighting please

    jerm

    October 25, 2011 at 7:44 am

  11. [...] Conveniently, the Regex class can be combined with Scala pattern-matching machinery to directly bind captured groups to local variables in one shot: [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s