Ikai Lan says

I say things!

Archive for April 2009

Using pattern matching with regular expressions in Scala

with 11 comments

I’ve been trying to use Scala more and more so I can gain some experience and exposure to it. A couple of weeks ago, I wrote a Scala log parser for Ruby on Rails. It is terribly newbie-ish – the classes are mutable and it’s disorganized. It’s a mess. Jorge Ortiz from the Scala mailing list was kind enough to rewrite it in a more Scala style. It completely blew my mind how terse Scala can become when written correctly.

It bothered me, however, dealing with regular expressions the way that I did. The Java interface is pretty clumsy and nowhere near as clean as regular expression pattern extraction in Perl or Ruby.

As it turns out, it’s surprisingly easy to extract text using Regular Expressions in Scala. Throw away Pattern.compile! Check out this hotness below:

First, let’s import Scala’s regex package:

import scala.util.matching.Regex

Now we declare a regular expression to match against. We can do this one of two ways:

val LogEntry = new Regex("""Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""")

I use triple quotes here to signify that I am creating a raw string. A raw string means that I do not need to escape characters like the \ character. If I didn’t do this, I’d be forced to use strings like “\\d+”. Believe it or not, that extra slash throws me off. Just goes to show that I have written way too many parsers.

Alternatively, I can declare a new Regex by doing this:

val LogEntry = """Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""".r

Strings have a method called “r”, which will convert it to a Regex object. I’m not sold on this syntax at the moment, since it doesn’t play well with eyeball scans, but I’m putting it here for those folks that absolutely need to save characters.

There’s nothing really special here yet. The next step is REALLY cool:

val line = "Completed in 100ms (View: 25, DB: 75) | 200 OK [http://app.domain.com?params=here]"

scala> val LogEntry(totalTime, viewTime, dbTime, responseCode, uri) = line
totalTime: String = 100
viewTime: String = 25
dbTime: String = 75
responseCode: String = 200
uri: String =

The local variables totalTime, viewTime, dbTime, responseCode and uri are now bound to the values we want to extract from the original line! The regular expression value defines an unapplySeq method. I’m not quite good enough at Scala to tell you in any definite terms what that means, except that you can use the code in a pattern match:

line match {
case LogEntry(totalTime, viewTime, dbTime, responseCode, uri) => {
/* Process the data */
// do something with totalTime.toInt
// do something with viewTime.toInt
// etc ...
}
case _ => // Do nothing
}

Because you can use a pattern match, and patterns will be be matched in the order of definition, this means that you can create several regular expressions representing lines you want to extract, then process them easily in using pattern matching.

Pretty powerful stuff. What would really make my day would be if someone knew how I could extract the values totalTime, viewTime, and dbTime as integers and not have to do a conversion – I’m already matching with \d+. Ideas?

Written by Ikai Lan

April 4, 2009 at 5:41 pm

Posted in Scala

Tagged with , , ,

Twitter, Ruby on Rails, Scala and people who don’t RTFA

with 34 comments

The Register recently published an article titled, “Twitter jilts Ruby for Scala“, prompting a wave of Tweets (Twitter messages for the jargon challenged) about Ruby on Rails and scaling. More specifically, its lack of ability to do do. The source of this article was a talk given by the API lead at Twitter, Alex Payne, at the Web 2.0 Expo in San Francisco, where he discussed porting parts of Twitter from Ruby to Scala. Naturally, the armchair commentators added their two cents about the message, taking it to mean that Scala is *the* one true language for internet companies, and that Ruby on Rails should not be used because it cannot scale.

I tweeted about my annoyance with retweeters that don’t understand what they are retweeting, and Alex responded: “It’s really frustrating for me. I don’t think the people who were there for my talk got that impression. The press skewed it.”

Alex wasn’t saying that Rails can’t scale. Alex was saying that he really likes Scala, and parts of Twitter that were written in Ruby were rewritten in Scala because some tasks are simply not appropriate for Ruby. I know of Kestrel, their messaging queue, but likely there are other middleware layers that sit between their APIs/web interfaces and their data sources that are being ported.

Now, Scala is an amazing language. It compiles to JVM bytecode, and thus, most procedures will run near or at the speed of Java. Functions can be treated as first class objects and passed around like variables. Its system of type inference means less ceremony for simply declaring variables. Traits function remarkably like mixins in dynamic languages, allowing very rich composite classes that, in many cases, need to add very little boilerplate for the composition, unlike Java interfaces. And pattern matching with wildcards? I can’t even think of an analogy for how powerful of a tool this is. Is it the best compiled language? Some think so, and I can certainly see where they are coming from. I’m learning Scala as we speak. The basics were pretty easy to pick up, but I’m still struggling to be good with it. Scala best practices such as programming in an immutable style are far easier to preach than to actual do.

But Scala is not Ruby, and will never be Ruby. Ruby is probably the most powerful programming language on the planet for creating DSLs, or domain specific languages. It can do this because any class or object can be extended, any method overridden, any constant undefined and redefined, and so on, and so forth. You can take an Integer and declare more methods on it. Duck typing means that you don’t have to set up boilerplate interfaces and abstract classes to create objects that can replace other objects with similar interfaces. A web development platform as powerful and syntactically concise as Rails could only have been done in Ruby. Imagine this: a developer looks at a model file representing a database table, and immediately understands the business rules regarding validity as well as the callbacks that will take place when the object is created, updated or destroyed. Or this: it is possible to create a plugin that extends collections objects to allow very complex pagination, and a client developer only has to include a file or remove the include statement. This plugin recognizes if the collection in question maps to a database, and if so, applies LIMIT and OFFSET statements to SQL queries as to not eagerly load too many objects at once. This plugin exists.

This power does not come without a price. Ruby is slow. Ruby is generally recognized as being twice as slow as Python, and at least an order of magnitude slower than C++ or Java. Charles Nutter, one of the creators of the JRuby VM for the Ruby language, recently posted an article about optimizations that could be made to the VM, and he improves Ruby performance by as much as twenty times by removing much of Ruby’s power and breaking compatibility. If anything, it shows us skeptics that there is no such thing as a free lunch.

So if Ruby is slow, and Rails uses Ruby, does that mean Rails cannot scale? Let me make this clear right now: Though Ruby is slow, this does not mean that Rails does not scale. Scalability is a very difficult concept, often oversimplified to be synonymous with efficiency. Efficiency is a facet of scalability, yes, but it is not its only dimension. Reliability and uptime are very important dimensions of a scalable system. So is horizontal scalability: the ability to serve an increase in usage with a corresponding increase in computational nodes. Though it may seem like a given that throwing hardware at a problem will increase capacity, this simply isn’t so. Anybody that has worked on a large data driven application will tell you that a centralized, authoritative source such as a database cluster is a bottleneck, and simply throwing hardware at the problem results in diminishing marginal returns. Then there is quality of service. A truly scalable architecture will have minimal service degradation as users and usage exponentially increase.

I use the word “architecture” because a truly scalable system that meets the requirements set by its intended consumer is rarely, if ever, about a single component. Even if we were just talking about speed and none of the other dimensions of scalability, there’s a great article by Joe Stump where he calls out language critics about this exact subject. C++ can probably assemble HTML at least a hundred times faster than Ruby, but this has minimal impact on what a user sees because disk reads are slow, talking to a data source is slow, and sending data over the internet is *really* slow. On top of all that, Steve Souders, creator of YSlow, a popular tool for benchmarking perceived speed of web sites, argues that for most sites, 90% of the wait time a user has to put up with is a result of loading assets, JavaScript and stylesheets. Rails and proper database constraints can ensure data integrity (reliability). Rails is by default stateless and sessions can be centralized (fault tolerance and horizontal scalability), check. Rails can scale, and does scale. Twitter is not replacing their web tier because it is fast enough, and they have focused on optimizing their middle tier, which sits between their web layer and their data layer. There’s no point in replacing this layer. Its performance and reliability are completely bound by the data and middleware layers.

This is where Scala enters the picture: middleware. Likely this refers to a non-RDBMS datasource serving denormalized social graph, a dispatching component for pushing messages through SMS and email, a queue and queue workers. There are a few requirements here: concurrency and computational speed, both of which are areas where Ruby falls flat because MRI Ruby, the stock VM, uses green threads which block on I/O and do not make use of multicore processors. In addition, Ruby’s memory requirements are aggressive. Scala, on the other hand, is fast and can is as concurrent as the JVM will allow, which is pretty damned concurrent.

So what Twitter did was optimize the bottleneck and leave their investment in the front-end intact. This shouldn’t surprise anybody. Facebook started out on PHP, but now their backend is an amalgam of C/C++, Erlang and other languages. Google runs Java and C++ for their backend, but I’m told several web tier services are written in Python. And so forth. This doesn’t mean don’t use Python, and don’t use PHP. It just means to be ready to optimize and possibly to replace those components when the time is right, which, for many startups, is a LOOOOOOONG way off.

In fact, I still push Ruby on Rails as the development platform of choice for web startups launching a 1.0. Why? There are many reasons. Here are a few:

  • You will hit the ground running fast, and you will have something working within a single development cycle. If your application has any kind of boilerplate, like user management, pagination, multiformat output, simple AJAX or CRUD functionality, Rails just saved you from having to write most of it.
  • Onboarding developers is fast. I’ve been on projects where new developers have been able to be productive the first or second day of looking at a project’s structure and tests
  • Rails emphasizes the importance of test driving development, and any good Rails developer will feel wrong writing code without corresponding tests – when I started Scala, I settled on the NetBeans IDE because it seemed like the easiest to get going with JUnit
  • Solving scalability problems too early is a bad idea. Nobody starts out with a sharded database. Instead of building features or attracting customers to a usable product, you spend your time building an incredibly complex system that will scale but at severe opportunity cost if it hits the market late. You want scalability problems, because it means you are growing too fast. In the words of The Wire’s Marlo Stanfield, “that sound like one of them good problems.”

– Ikai

Written by Ikai Lan

April 2, 2009 at 9:58 pm

Posted in Ruby on Rails, Scalability

Tagged with , ,