Using pattern matching with regular expressions in Scala
I’ve been trying to use Scala more and more so I can gain some experience and exposure to it. A couple of weeks ago, I wrote a Scala log parser for Ruby on Rails. It is terribly newbie-ish – the classes are mutable and it’s disorganized. It’s a mess. Jorge Ortiz from the Scala mailing list was kind enough to rewrite it in a more Scala style. It completely blew my mind how terse Scala can become when written correctly.
It bothered me, however, dealing with regular expressions the way that I did. The Java interface is pretty clumsy and nowhere near as clean as regular expression pattern extraction in Perl or Ruby.
As it turns out, it’s surprisingly easy to extract text using Regular Expressions in Scala. Throw away Pattern.compile! Check out this hotness below:
First, let’s import Scala’s regex package:
import scala.util.matching.Regex
Now we declare a regular expression to match against. We can do this one of two ways:
val LogEntry = new Regex("""Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""")
I use triple quotes here to signify that I am creating a raw string. A raw string means that I do not need to escape characters like the \ character. If I didn't do this, I'd be forced to use strings like "\\d+". Believe it or not, that extra slash throws me off. Just goes to show that I have written way too many parsers.
Alternatively, I can declare a new Regex by doing this:
val LogEntry = """Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""".r
Strings have a method called "r", which will convert it to a Regex object. I'm not sold on this syntax at the moment, since it doesn't play well with eyeball scans, but I'm putting it here for those folks that absolutely need to save characters.
There's nothing really special here yet. The next step is REALLY cool:
val line = "Completed in 100ms (View: 25, DB: 75) | 200 OK [http://app.domain.com?params=here]"
scala> val LogEntry(totalTime, viewTime, dbTime, responseCode, uri) = line
totalTime: String = 100
viewTime: String = 25
dbTime: String = 75
responseCode: String = 200
uri: String =
The local variables totalTime, viewTime, dbTime, responseCode and uri are now bound to the values we want to extract from the original line! The regular expression value defines an unapplySeq method. I’m not quite good enough at Scala to tell you in any definite terms what that means, except that you can use the code in a pattern match:
line match {
case LogEntry(totalTime, viewTime, dbTime, responseCode, uri) => {
/* Process the data */
// do something with totalTime.toInt
// do something with viewTime.toInt
// etc ...
}
case _ => // Do nothing
}
Because you can use a pattern match, and patterns will be be matched in the order of definition, this means that you can create several regular expressions representing lines you want to extract, then process them easily in using pattern matching.
Pretty powerful stuff. What would really make my day would be if someone knew how I could extract the values totalTime, viewTime, and dbTime as integers and not have to do a conversion – I’m already matching with \d+. Ideas?

Hi, here is some info about the unapply magic: http://www.scala-lang.org/node/112
I am puzzling over returning integers though…
Channing Walton
April 5, 2009 at 1:03 pm
Cool! I had no idea that Regex let you pattern match, which is pretty awesome.
Unfortunately, you can’t make Regex return Ints directly. Types are a compile-time thing, and your pattern can be any arbitrary String constructed at run-time, so in general there’s no way for the type-system to know what types of things you’ll be matching on. However, with a little compile-time help you can achieve what you want:
val le = “”"Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""".r
val line = "Completed in 100ms (View: 25, DB: 75) | 200 OK [http://app.domain.com?params=here]”
object LogEntry {
def unapply(line: String): Option[(Int, Int, Int, String, String)] =
for (List(ttime, vtime, dtime, rc, uri) <- le.unapplySeq(line))
yield (ttime.toInt, vtime.toInt, dtime.toInt, rc, uri)
}
val LogEntry(totalTime, viewTime, dbTime, responseCode, uri) = line
totalTime: Int = 100
viewTime: Int = 25
dbTime: Int = 75
responseCode: String = 200
uri: String =
Jorge Ortiz
April 5, 2009 at 4:11 pm
Jorge, what’s the advantage of the for comprehension over something less cerimonial?
object LogEntry {
def unapply(line: String) = {
val le(a,b,c,d,e) = line
Some(a.toInt, b.toInt, c.toInt, d, e)
}
}
jherber
April 5, 2009 at 6:54 pm
You could do something like:
object I {
def unapply(s: String): Option[Int] =
try { Some(s.toInt) }
catch { case _: NumberFormatException => None }
}
val LogEntry(I(totalTime), I(viewTime), I(dbTime), I(responseCode), uri) = line
totalTime: Int = 100
viewTime: Int = 25
dbTime: Int = 75
responseCode: Int = 200
uri: String =
Mark Harrah
April 8, 2009 at 6:52 am
If (\d+) returns something that throws NumberFormatException, I think you’d want to crow about it pretty loudly.
jpitts
January 25, 2010 at 12:39 pm
[...] matching extrahieren bsp [...]
mein Blog » Blog Archive » Reguläre Ausdrücke, Pattern mathing
August 10, 2009 at 10:00 am