Ikai Lan says

I say things!

JRuby In-Memory Search Example With Lucene 3.0.1

with 2 comments

Just for giggles I decided to port the In-Memory search example from my last blog post to JRuby. It’s been some time since I’ve used JRuby for anything, but the team has still been hard at work making strides towards better Java interoperability and ease of use. I downloaded JRuby 1.5.0_RC1, pointed my PATH to the /bin directory, and began hacking.

I’m incredibly impressed with the level of Java interop and startup speed improvements. Kudos to the JRuby team. Integrating Java couldn’t have been easier.

The example is below. Run it with the command:


jruby -r /path/to/lucene-core-3.0.1.jar inmemory.rb

require 'java'
# You either use the next line by require the JAR file, or you pass
# the -r flag to JRuby as follows:
# jruby -r /path/to/lucene-core-3.0.1.jar inmemory.rb 
# require "lucene-core-3.0.1.jar"

java_import org.apache.lucene.analysis.standard.StandardAnalyzer
java_import org.apache.lucene.document.Document
java_import org.apache.lucene.document.Field
java_import org.apache.lucene.index.IndexWriter
java_import org.apache.lucene.queryParser.ParseException
java_import org.apache.lucene.queryParser.QueryParser
java_import org.apache.lucene.store.RAMDirectory
java_import org.apache.lucene.util.Version

java_import org.apache.lucene.search.IndexSearcher
java_import org.apache.lucene.search.TopScoreDocCollector


def create_document(title, content)
  doc = Document.new
  doc.add Field.new("title", title, Field::Store::YES, Field::Index::NO)
  doc.add Field.new("content", content, Field::Store::YES, Field::Index::ANALYZED)  
  doc
end

def create_index
  idx     = RAMDirectory.new
  writer  = IndexWriter.new(idx, StandardAnalyzer.new(Version::LUCENE_30), IndexWriter::MaxFieldLength::LIMITED)

  writer.add_document(create_document("Theodore Roosevelt",
          "It behooves every man to remember that the work of the " +
                  "critic, is of altogether secondary importance, and that, " +
                  "in the end, progress is accomplished by the man who does " +
                  "things."))
  writer.add_document(create_document("Friedrich Hayek",
          "The case for individual freedom rests largely on the " +
                  "recognition of the inevitable and universal ignorance " +
                  "of all of us concerning a great many of the factors on " +
                  "which the achievements of our ends and welfare depend."))
  writer.add_document(create_document("Ayn Rand",
          "There is nothing to take a man's freedom away from " +
                  "him, save other men. To be free, a man must be free " +
                  "of his brothers."))
  writer.add_document(create_document("Mohandas Gandhi",
          "Freedom is not worth having if it does not connote " +
                  "freedom to err."))

  writer.optimize
  writer.close
  idx
end

def search(searcher, query_string)
  parser = QueryParser.new(Version::LUCENE_30, "content", StandardAnalyzer.new(Version::LUCENE_30))
  query = parser.parse(query_string)
  
  hits_per_page = 10
  
  collector = TopScoreDocCollector.create(5 * hits_per_page, false)
  searcher.search(query, collector)
  
  # Notice how this differs from the Java version: JRuby automagically translates
  # underscore_case_methods into CamelCaseMethods, but scoreDocs is not a method:
  # it's a field. That's why we have to use CamelCase here, otherwise JRuby would
  # complain that score_docs is an undefined method.
  hits = collector.top_docs.scoreDocs
  
  hit_count = collector.get_total_hits
    
  if hit_count.zero?
    puts "No matching documents."
  else
    puts "%d total matching documents" % hit_count
    
    puts "Hits for %s were found in quotes by:" % query_string
    
    hits.each_with_index do |score_doc, i|
      doc_id = score_doc.doc
      doc_score = score_doc.score
      
      puts "doc_id: %s \t score: %s" % [doc_id, doc_score]
      
      doc = searcher.doc(doc_id)
      puts "%d. %s" % [i, doc.get("title")]
      puts "Content: %s" % doc.get("content")
      puts
      
    end
    
  end

end

def main
  index = create_index
  searcher = IndexSearcher.new(index)

  search(searcher, "freedom")
  search(searcher, "free");
  search(searcher, "progress or achievements");
  search(searcher, "ikaisays.com")

  searcher.close
end

main()
Advertisements

Written by Ikai Lan

April 25, 2010 at 7:49 pm

Posted in JRuby, JRuby, Ruby, Software Development

Tagged with , ,

Lucene In-Memory Search Example: Now updated for Lucene 3.0.1

with 3 comments

Update: Here’s a link to some sample code for Python using PyLucene. Thanks, Joseph!

While playing around with Lucene in my experiments to make it work with Google App Engine, I found an excellent example for indexing some text using Lucene in-memory; unfortunately, it dates back to May 2004 (!!!). I’ve updated the example to work with the newest version of Lucene, 3.0.1. It’s below for reference.

The Pastie link for the code snippet can be found here.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import java.io.IOException;

public class LuceneTest{

   public static void main(String[] args) {
      // Construct a RAMDirectory to hold the in-memory representation
      // of the index.
      RAMDirectory idx = new RAMDirectory();

      try {
         // Make an writer to create the index
         IndexWriter writer =
                 new IndexWriter(idx, 
                         new StandardAnalyzer(Version.LUCENE_30), 
                         IndexWriter.MaxFieldLength.LIMITED);

         // Add some Document objects containing quotes
         writer.addDocument(createDocument("Theodore Roosevelt",
                 "It behooves every man to remember that the work of the " +
                         "critic, is of altogether secondary importance, and that, " +
                         "in the end, progress is accomplished by the man who does " +
                         "things."));
         writer.addDocument(createDocument("Friedrich Hayek",
                 "The case for individual freedom rests largely on the " +
                         "recognition of the inevitable and universal ignorance " +
                         "of all of us concerning a great many of the factors on " +
                         "which the achievements of our ends and welfare depend."));
         writer.addDocument(createDocument("Ayn Rand",
                 "There is nothing to take a man's freedom away from " +
                         "him, save other men. To be free, a man must be free " +
                         "of his brothers."));
         writer.addDocument(createDocument("Mohandas Gandhi",
                 "Freedom is not worth having if it does not connote " +
                         "freedom to err."));

         // Optimize and close the writer to finish building the index
         writer.optimize();
         writer.close();

         // Build an IndexSearcher using the in-memory index
         Searcher searcher = new IndexSearcher(idx);

         // Run some queries
         search(searcher, "freedom");
         search(searcher, "free");
         search(searcher, "progress or achievements");

         searcher.close();
      }
      catch (IOException ioe) {
         // In this example we aren't really doing an I/O, so this
         // exception should never actually be thrown.
         ioe.printStackTrace();
      }
      catch (ParseException pe) {
         pe.printStackTrace();
      }
   }

   /**
    * Make a Document object with an un-indexed title field and an
    * indexed content field.
    */
   private static Document createDocument(String title, String content) {
      Document doc = new Document();

      // Add the title as an unindexed field...

      doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));


      // ...and the content as an indexed field. Note that indexed
      // Text fields are constructed using a Reader. Lucene can read
      // and index very large chunks of text, without storing the
      // entire content verbatim in the index. In this example we
      // can just wrap the content string in a StringReader.
      doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));

      return doc;
   }

   /**
    * Searches for the given string in the "content" field
    */
   private static void search(Searcher searcher, String queryString)
           throws ParseException, IOException {

      // Build a Query object
      QueryParser parser = new QueryParser(Version.LUCENE_30, 
              "content", 
              new StandardAnalyzer(Version.LUCENE_30));
      Query query = parser.parse(queryString);


      int hitsPerPage = 10;
      // Search for the query
      TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);
      searcher.search(query, collector);

      ScoreDoc[] hits = collector.topDocs().scoreDocs;

      int hitCount = collector.getTotalHits();
      System.out.println(hitCount + " total matching documents");

      // Examine the Hits object to see if there were any matches

      if (hitCount == 0) {
         System.out.println(
                 "No matches were found for \"" + queryString + "\"");
      } else {
         System.out.println("Hits for \"" +
                 queryString + "\" were found in quotes by:");

         // Iterate over the Documents in the Hits object
         for (int i = 0; i < hitCount; i++) {
            ScoreDoc scoreDoc = hits[i];
            int docId = scoreDoc.doc;
            float docScore = scoreDoc.score;
            System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);

            Document doc = searcher.doc(docId);

            // Print the value that we stored in the "title" field. Note
            // that this Field was not indexed, but (unlike the
            // "contents" field) was stored verbatim and can be
            // retrieved.
            System.out.println("  " + (i + 1) + ". " + doc.get("title"));
            System.out.println("Content: " + doc.get("content"));            
         }
      }
      System.out.println();
   }
}

In progress: still trying to figure out how to get some version of Lucene working on App Engine for Java. My thoughts:

  • Use an In Memory index
  • Serialize to Memcache or the Datastore (not even sure how to do this right now)

Granted, there are limitations to this: if an App Engine application exceeds some memory limit, a SoftMemoryExceeded exception will be thrown. Also – I’m doubtful of the ability to update indexes incrementally in the datastore: not to mention, there’s a 1mb limit on datastore entries. The Blobstore, accessed programmatically, may not have the latency required. Still – it’s an interesting thought experiment, and there’s probably some compromise we can find with a future feature of App Engine that’ll allow us to make Lucene actually usable. We just have to think of it. Stay tuned. I’ll write another post if I can get even a proof-of-concept to work.

Written by Ikai Lan

April 24, 2010 at 8:32 am

How I use social media

leave a comment »

I’m often asked how I have time to use all the services I do such as Twitter, Facebook, LinkedIn, and so forth. I just do – it really doesn’t take up that much of my time, and I find that each of these services provides unique value to me, both professionally and socially. No, this isn’t a “thou shalt” type of article. Rather, it’s a breakdown of how I use the various social media outlets.

Facebook

I recently did a massive purge of my friends list. I had somewhere around 700 friends, and I sheared that down to a little over 200. I was adding folks without abandon for a while – if I met you once face to face, I clicked the “Accept” when that ever so exciting “New Friend Request” notification fell into my inbox. No more. I’ve sheared it down to people I actually talk to, or talk to me, or somehow interact with my Facebook page. Facebook is where I go to post pictures of myself wearing a fake mustache, or in a giant banana suit trying to negotiate a pose with a guy dressed up as Wolverine, or a video of my rendition of the Whitney Houston classic, “I have nothing.” Facebook is where I go to be stupid. Every once in a while, I’ll receive a business inquiry or friend request from someone I’ve met at a conference or in a professional context online. I’ll promptly decline.

LinkedIn

A quick disclaimer: I used to work at LinkedIn. While I was there, I loved working there and I loved the vision of the site. My entire engineering and design team that I worked with has since left, so I can’t attest to how it still is there (though there are still some cool people I know and talk to now and then).

LinkedIn is where I maintain a profile and look folks up several times a week. I used to check LinkedIn on my mobile device once a week at least to look for updates to folks I’ve worked with or had meaningful professional exchanges with, but in recent months the Twitter integration has made it too damned chatty and this feature isn’t useful for me anymore (I’ve expressed how I feel about this in a recent Tweet). For me, LinkedIn is for people search.

Twitter

Twitter has replaced LinkedIn as my weapon of choice for keeping up with my professional contacts. Nowadays, when I give a presentation or meet someone in the technical field, chances are, they have a Twitter account. I’ll tell them to follow me, and in most cases, I’ll reciprocate. Twitter has the entire ecosystem for me to use my tools on my desktop (Tweetdeck, Seesmic web) and on my mobile device (Tweetie 2 for iPhone, Seesmic for Android) to exchange quick messages and share interesting content. I don’t follow celebrities, and in most cases, I don’t follow my friends since many of them do not use Twitter professionally and cross post to both Facebook and Twitter anyway.

I’ve also met lots of interesting developers via Twitter. I have search columns in Tweetdeck and in my Seesmic Web session for terms I care about: scala, jruby, ruby, python, app engine, java, clojure. When I see someone tweeting interesting content, I will follow them and try to engage them. Most Twitter users are open to meetings at conferences or coffee/beer when they or you come to town. This is an incredibly unique feature of Twitter that allows me to find professionals by what they are currently working on or interested in that I can’t currently get anywhere else. If you work in technology, get on Twitter and start connecting.

Gowalla

I prefer Gowalla to Foursquare in spite of all the hype for the latter. Why? Gowalla has people I actually know adding me. Too many folks on Foursquare are people following me on Twitter that I have never met (very few of the folks I have met on Twitter have added me on Foursquare). There’s an argument here that it helps them find me. I don’t buy it. We’ll meet on terms we agree on, not when you find out I am near you and decide to pop in (the internet is still full of crazies). Gowalla has a nicer app for both iPhone and Android from a visual standpoint and from the fact that they actually figure out in the top three or so results where I am. On Foursquare, the place where I am at is too buried, and if it doesn’t exist, it’s too hard to add it – you have to add an address, city, zipcode, etc. On Gowalla this is a two step process.

So how useful is Gowalla to me? Marginal. If it disappeared tomorrow, nothing would change in my life. This is in stark contrast to Facebook, Twitter and LinkedIn. Gowalla’s primary value proposition to me is checking in some place cool and posting quickly to Facebook so my friends can make silly comments.

Google Buzz

Disclaimer #2: I work for Google now. We’ve been using Buzz months before it launched internally, and it is super useful to be able to look at all the things members of your team are doing. This is what Yammer has been trying to accomplish. I’ve heard from distributed teams that use Yammer that it facilitates greater teamwork and collaboration. I definitely feel this way about internal Buzz.

The public Buzz is pretty cool, but hasn’t caught on among as many as my friends as I would have liked. I’ve unwired my Twitter connections because I already see them on Twitter. Buzz has a better discussion mechanism than Facebook, LinkedIn, Twitter – basically all of those, because it bubbles recently commented items to the top. I use Buzz pretty frequently to discuss local sports and video games with a circle of about ten people. I pity the folks who follow me and don’t care about these topics.

Yelp

When I first created my Yelp account in 2006 or so, I wrote about 60 reviews in the first week. It was loads of fun to meet other local folks! Nowadays, there are so many reviews that I feel like reviews don’t matter anymore, and I haven’t written in ages. To fill that need, I’ve started guest blogging for The Culture Bite, a food blog run by a friend of mine. I still use Yelp to find reviews for restaurants, though I take them with a grain of salt because folks will downrate a place for different reasons. Some may downrate for mandatory 18% gratuity (lots of places do this, people), loud patrons, a poor parking situation, restricted hours, and so forth. People have different ideas of what should be considered part of the dining experience, and this makes my weight the rating system less.

Written by Ikai Lan

March 25, 2010 at 8:38 pm

Posted in Uncategorized

GoRuCo 2009 impressions

leave a comment »

I passed on RailsConf this year, mostly for scheduling reasons. Instead, I attended the Gotham Ruby Conference (GoRuCo) this past Saturday at Pace University at NYC. Overall, I was very impressed with the caliber of attendees and would go again.

While all of the talks were outstanding, there were a few that I felt really stood out:

Where is Ruby headed?

Gregory Brown, creator of the Prawn PDF generation library for Ruby, funded completely by the community.

This was the first talk of the day. I didn’t sleep particularly well the night before and was a bit worried that I wouldn’t be ready to get anything out of the talk, but I’m happy to say that I was wrong. Gregory addressed several of the issues he had with making sure that Prawn would be compatible with both 1.8.6 and 1.9.1. He addressed various topics from very specific code samples of 1.8.6 and 1.9.1 differences to various workarounds for code that needs to be compatible with both versions (I/O iterators … #each is now #each_line). Eventually, this led to a few interesting discussions:

  • The Ruby versioning system is EXTREMELY confusing. 1.8.7 is not a *minor* update, but a rather significant one, and 1.9.1 is way more mature than 1.9
  • The general consensus is to not try to support 1.8.7, and to focus on 1.8.6 and 1.9.1. Eric Hodel, maintainer of RubyGems, will be moving towards only support 1.9.1+ in the coming months, but it’ll be a slow process.
  • The best way to include external C libraries in Ruby on Windows is FFI with JRuby. It’s a mad, mad, mad world.

SOLID Object Oriented Design

Sandi Metz

The slides are here: http://skmetz.home.mindspring.com/img1.html

I thought this was an amazing talk. Sandi discussed the methodology behind refactoring Object Oriented code. In this case, she discussed the design of a Ruby FTP client that would download a CSV file and do some trivial computation.

However, I felt that Sandi’s example is purposely hypothetical and meant to illustrate refactoring techniques. Realistically, for something as simple as an FTP client, the example was an exercise in over-engineering. There’s a balance between extensibility/modularity and creating a simple to use interface. It’s clear that Sandi is leaning towards the former, as it got to a point where she was evaluating a class name from a YAML file (RED FLAG, RED FLAG). I disagree with this approach, since Ruby can be so terse and powerful that in many cases, code should be configuration, especially if the configuration is for specifying a class name for the purposes of dependency injection. This is not the Ruby way, and there were comments that we had essentially turned the FTP client into something that was all about configuration and not convention, and it eventually leads to death by XML. Use judgment. Sandi’s techniques are great for refactoring complex interactions, but as Rubyists we understand that not everything is an object. Make the code readable for humans, and make the APIs clean. I don’t know if I agree with Sandi’s philosophy that “since you don’t know how your code will be used, it should be as modular as possible”. That sounds to me like front-loading engineering. I’m going to quote Reid Hoffman: “If you are not embarrassed by the first version of your product, you’ve launched too late.” Be pragmatic.

From Rails to Rack: Making Rails 3 a Better Ruby Citizen

Yehuda Katz, Project Lead for Merb, Engine Yard

Yehuda discussed a few of the major refactorings of Rails taking place for version 3. It’s all the type of stuff you hear that is great about Merb: Rack-ability (ability to mount Rack apps inside Rails), ORM agnosticism and view agnosticism. What has me really excited, however, is the approach is unobtrusive JavaScript for RJS helpers. When I became proficient at JavaScript, I stopped using RJS not because it didn’t give me enough options, but because I found that inlining JavaScript had a tendency to get in the way.

So did I miss anything? Please comment if I did!

Written by Ikai Lan

June 2, 2009 at 11:15 am

Posted in Uncategorized

Toeing the line: My take on the GoGaRuCo presentation fiasco

with one comment

No, I’m not going to further whip Matt Aimonetti about his poorly chosen theme for his talk about CouchDB at GoGaRuCo this year. I think that enough has been said about the topic by far more influential members of the community than myself (there are many more links). Matt Aimonetti has already apologized, and I believe that he is genuinely apologetic. I will disagree, however, with Sarah Mei’s assertion that Rails is still a ghetto. If anything, the fact that we have been able to discuss this topic in such depth shows that the Ruby diaspora is not a group of antisocial hackers. We’re not a ghetto; we’re a real community that cares not only for the quality of our craft, but for being responsible craftsmen that care about having a positive impact on society through our passion and professionalism.

What I haven’t seen a lot of, however, is how we can benefit from what has already happened. We can’t go back and stuff the proverbial cat back in the bag. What’s done is done, and in order to move on we need to learn from our mistakes. I can’t rationalize why Matt Aimonetti did what he did better than he can, but I can an embarrassing story about myself making a very similar mistake. “All this has happened before, and all this will happen again,” I believe the adage goes:

My sophmore year in college, I took a job working in Housing IT to pay the bills. I was majoring in Computer Science and had to start somewhere; prior to the job I had worked doing data entry and filing clerk type duties. It was a huge step up for me. I was 19, pretty good with computers, and overall excited to finally have a real tech job.

For whatever reason, at the time, SJSU Housing IT’s leadership, in all their wisdom, decided that every student that would be using the residential network (ResNet) needed to have their NIC’s MAC address registered and a static IP configured. I suppose there were liability reasons, or the rules were created by someone with no knowledge of how modern networks function, or whatever. The point is that it was a particularly arduous job to go around to every student’s computer and, no matter how old or how messed up their copies of Windows 98 and Windows ME were, to get them up and running with internet access. Housing IT hired 5 of us students to serve as PARCs, short for Peer Advisors for Residential Computing, to deal with the management of this process, and when the initial configuration was done, to slip into more Helpdesk like positions. We had these gigantic binders of printouts of MAC addresses, their assigned Ethernet port number, and their IP address, and the first few weeks of school, I spent my time buried in hell of Windows TCP/IP configuration hell (random note: a semester after I left, one ambitious student set up DHCP … and I said to him, “You idiot, you just put all the remaining PARCs out of work” in jest).

The challenge we had was information propagation and telling students the process for signing up for an internet configuration appointment. You have to remember what life was like before the age of ubiquitous Web 2.0 for college students. You got information out by posting fliers that no one read and making announcements that no one listens to. And of course, we couldn’t post information on the website or email students, because, well, they didn’t have an internet connection.

So being 19, ambitious and a bit of a maverick, I took it upon myself to start a bit of what resembled a viral marketing campaign. I took pictures of all the different PARCs and created posters that would have what I thought were funny taglines, a picture of a PARC, and a short description of the process of signing up for internet. The taglines were pretty random. I don’t remember most of them, but they were along these lines:

GOT MILK?
BE MY FRIEND!
DO YOU LIKE MY HAIR?

In my mind, it was pretty hilarious, since housing seemed to make it a point to take terribly unflattering pictures of us wearing branded shirts in the only size they could afford: XL. Ah, public school.

I saved the edgy one for myself, since, hey, I’m an edgy guy. I had a grin from cheek to cheek and blinked during the photo. Here’s what I put above my face:

DOES THIS LOOK LIKE A CHILD ABDUCTOR?

It was funny, right? Right? I mean, I laughed when I was making it.

Imagine my surprise when the excrement hit the fan. I had raised a furor, and it came All The Way Down The Mountain from the desk of the president of housing through several layers of management, and finally my boss, who called everyone on my team to ask who the culprit was. I quickly fessed up, and even though I was a little bit confused – it was only a joke – I was summoned to his office, where he asked me about the fliers, holding one in hand.

“What part of you thought this was a good idea?” He asked. He was usually a pretty jovial guy. He was not smiling.

I didn’t know, and I told him. But I didn’t try to defend why I did it. In fact, at this point, part of me still thought I was in trouble for unofficially posting fliers that hadn’t been signed off (and rightly so) and not because of the flier with my face on it subtlety implying that I was some kind of a sex offender.

“No,” my boss said to me. “These other ones are fine. You’re a smart guy, what the hell could compel you to make a flier like this? Don’t you realize that many of these parents are first time college parents, and this is the first time their children are away from home?” He paused and looked me in the eye because he wanted the next point to really stick, “Don’t you realize that we might fire you?”

Wow. I slunk down in my chair. My first real day working my first job I cared about, and I was on the verge of losing it. Looking back, I sometimes wonder if my shame came more from the fact that I almost lost my job, or more from the fact that I could do something so boneheaded and not realize how people would react. Luckily, I didn’t lose my job, though I spent a lot of time hunting down fliers and apologizing to people. And configuring &%&*@* Windows ME TCP/IP settings, of course.

In retrospect, I’m glad the whole mess took place, because I learned a lot about being edgy, and the seminar we had prior to student check-in about respecting a diversity of opinions really sunk in. I haven’t forgotten them since, and I don’t think I ever will. I learned that even though I am, myself, someone who likes to make jokes and be edgy and push the envelope, not everyone else is. The same traits that can make me interesting and fun to be around also have the potential to make me obnoxious, offensive, and to create a bad perception of myself in the eyes of people who don’t really know me yet. The easiest lesson to extract here is to simply err on the side of caution when communicating with people who don’t know you, but I’d like to think it goes much deeper than that. I don’t think we give ourselves enough credit for just being decent people on a day-to-day basis, because interacting with other humans is really challenging, and we’ll often get it wrong. Even though we can usually be safe by avoiding sex, politics and religion, we still run the risk of offending people. I didn’t realize my “child abduction” tagline would lead to parental types with torches and pitchforks calling for my head. I stayed in a hostel on a recent overseas trip and the manager confided with me that he was there because he spent a few years in a maximum security penitentiary and was deported; a few days later I made some joke about a jail cell to him before I realized how stupid that was and walked away, savoring the taste of my foot. Human beings are notoriously difficult creatures, and that’s why there’s a premium on respecting others. It’s not easy to do. We can be R-rated individuals. I’m one. But that doesn’t give us the right to be assholes.

I can see how and why Matt Aimonetti ended up with the presentation he did, because I’ve made the same mistake. Believe me, everyone, he feels like crap. Thanks to the interwebs, it’s a thousand times worse for him than it was for me. A few months ago, I submitted a talk about social media, and the conference organizer emailed me back giving me some advice about one of the screenshots of Facebook I had taken. There was a swear word on one of the images. I honestly didn’t even notice it and I quickly removed it because it added nothing to my talk. He didn’t ask me to remove it, he just suggested that in previous years, some people complained about swearing. I wouldn’t have been one of those people, but also, I wouldn’t have been one of the people complaining about people complaining about swearing. As a community, we have matured to the point where we are starting to set boundaries. I can understand the Ruby community’s aversion to fences; Matz has always placed the importance of happiness over that of arbitrary rules. These rules, however, aren’t arbitrary. In fact, it’s through these boundaries that we open up our community to outsiders and become inclusive of people who want to create great applications, use fantastic tools, and work with awesome people. In a sense, I’m glad someone has set off this fire, because a lot of things that needed to be said have been said, and as a result, I’m absolutely positive the Ruby community will become better for it. I’m glad someone has pushed the boundaries a bit too far, and as a result, we’ve all stepped back and said, “No, we can’t.” And lastly, I know it’s selfish: I’m glad that someone wasn’t me.

– Ikai

Written by Ikai Lan

May 4, 2009 at 11:52 pm

Posted in Uncategorized

Using pattern matching with regular expressions in Scala

with 11 comments

I’ve been trying to use Scala more and more so I can gain some experience and exposure to it. A couple of weeks ago, I wrote a Scala log parser for Ruby on Rails. It is terribly newbie-ish – the classes are mutable and it’s disorganized. It’s a mess. Jorge Ortiz from the Scala mailing list was kind enough to rewrite it in a more Scala style. It completely blew my mind how terse Scala can become when written correctly.

It bothered me, however, dealing with regular expressions the way that I did. The Java interface is pretty clumsy and nowhere near as clean as regular expression pattern extraction in Perl or Ruby.

As it turns out, it’s surprisingly easy to extract text using Regular Expressions in Scala. Throw away Pattern.compile! Check out this hotness below:

First, let’s import Scala’s regex package:

import scala.util.matching.Regex

Now we declare a regular expression to match against. We can do this one of two ways:

val LogEntry = new Regex("""Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""")

I use triple quotes here to signify that I am creating a raw string. A raw string means that I do not need to escape characters like the \ character. If I didn’t do this, I’d be forced to use strings like “\\d+”. Believe it or not, that extra slash throws me off. Just goes to show that I have written way too many parsers.

Alternatively, I can declare a new Regex by doing this:

val LogEntry = """Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""".r

Strings have a method called “r”, which will convert it to a Regex object. I’m not sold on this syntax at the moment, since it doesn’t play well with eyeball scans, but I’m putting it here for those folks that absolutely need to save characters.

There’s nothing really special here yet. The next step is REALLY cool:

val line = "Completed in 100ms (View: 25, DB: 75) | 200 OK [http://app.domain.com?params=here]"

scala> val LogEntry(totalTime, viewTime, dbTime, responseCode, uri) = line
totalTime: String = 100
viewTime: String = 25
dbTime: String = 75
responseCode: String = 200
uri: String =

The local variables totalTime, viewTime, dbTime, responseCode and uri are now bound to the values we want to extract from the original line! The regular expression value defines an unapplySeq method. I’m not quite good enough at Scala to tell you in any definite terms what that means, except that you can use the code in a pattern match:

line match {
case LogEntry(totalTime, viewTime, dbTime, responseCode, uri) => {
/* Process the data */
// do something with totalTime.toInt
// do something with viewTime.toInt
// etc ...
}
case _ => // Do nothing
}

Because you can use a pattern match, and patterns will be be matched in the order of definition, this means that you can create several regular expressions representing lines you want to extract, then process them easily in using pattern matching.

Pretty powerful stuff. What would really make my day would be if someone knew how I could extract the values totalTime, viewTime, and dbTime as integers and not have to do a conversion – I’m already matching with \d+. Ideas?

Written by Ikai Lan

April 4, 2009 at 5:41 pm

Posted in Scala

Tagged with , , ,

Twitter, Ruby on Rails, Scala and people who don’t RTFA

with 34 comments

The Register recently published an article titled, “Twitter jilts Ruby for Scala“, prompting a wave of Tweets (Twitter messages for the jargon challenged) about Ruby on Rails and scaling. More specifically, its lack of ability to do do. The source of this article was a talk given by the API lead at Twitter, Alex Payne, at the Web 2.0 Expo in San Francisco, where he discussed porting parts of Twitter from Ruby to Scala. Naturally, the armchair commentators added their two cents about the message, taking it to mean that Scala is *the* one true language for internet companies, and that Ruby on Rails should not be used because it cannot scale.

I tweeted about my annoyance with retweeters that don’t understand what they are retweeting, and Alex responded: “It’s really frustrating for me. I don’t think the people who were there for my talk got that impression. The press skewed it.”

Alex wasn’t saying that Rails can’t scale. Alex was saying that he really likes Scala, and parts of Twitter that were written in Ruby were rewritten in Scala because some tasks are simply not appropriate for Ruby. I know of Kestrel, their messaging queue, but likely there are other middleware layers that sit between their APIs/web interfaces and their data sources that are being ported.

Now, Scala is an amazing language. It compiles to JVM bytecode, and thus, most procedures will run near or at the speed of Java. Functions can be treated as first class objects and passed around like variables. Its system of type inference means less ceremony for simply declaring variables. Traits function remarkably like mixins in dynamic languages, allowing very rich composite classes that, in many cases, need to add very little boilerplate for the composition, unlike Java interfaces. And pattern matching with wildcards? I can’t even think of an analogy for how powerful of a tool this is. Is it the best compiled language? Some think so, and I can certainly see where they are coming from. I’m learning Scala as we speak. The basics were pretty easy to pick up, but I’m still struggling to be good with it. Scala best practices such as programming in an immutable style are far easier to preach than to actual do.

But Scala is not Ruby, and will never be Ruby. Ruby is probably the most powerful programming language on the planet for creating DSLs, or domain specific languages. It can do this because any class or object can be extended, any method overridden, any constant undefined and redefined, and so on, and so forth. You can take an Integer and declare more methods on it. Duck typing means that you don’t have to set up boilerplate interfaces and abstract classes to create objects that can replace other objects with similar interfaces. A web development platform as powerful and syntactically concise as Rails could only have been done in Ruby. Imagine this: a developer looks at a model file representing a database table, and immediately understands the business rules regarding validity as well as the callbacks that will take place when the object is created, updated or destroyed. Or this: it is possible to create a plugin that extends collections objects to allow very complex pagination, and a client developer only has to include a file or remove the include statement. This plugin recognizes if the collection in question maps to a database, and if so, applies LIMIT and OFFSET statements to SQL queries as to not eagerly load too many objects at once. This plugin exists.

This power does not come without a price. Ruby is slow. Ruby is generally recognized as being twice as slow as Python, and at least an order of magnitude slower than C++ or Java. Charles Nutter, one of the creators of the JRuby VM for the Ruby language, recently posted an article about optimizations that could be made to the VM, and he improves Ruby performance by as much as twenty times by removing much of Ruby’s power and breaking compatibility. If anything, it shows us skeptics that there is no such thing as a free lunch.

So if Ruby is slow, and Rails uses Ruby, does that mean Rails cannot scale? Let me make this clear right now: Though Ruby is slow, this does not mean that Rails does not scale. Scalability is a very difficult concept, often oversimplified to be synonymous with efficiency. Efficiency is a facet of scalability, yes, but it is not its only dimension. Reliability and uptime are very important dimensions of a scalable system. So is horizontal scalability: the ability to serve an increase in usage with a corresponding increase in computational nodes. Though it may seem like a given that throwing hardware at a problem will increase capacity, this simply isn’t so. Anybody that has worked on a large data driven application will tell you that a centralized, authoritative source such as a database cluster is a bottleneck, and simply throwing hardware at the problem results in diminishing marginal returns. Then there is quality of service. A truly scalable architecture will have minimal service degradation as users and usage exponentially increase.

I use the word “architecture” because a truly scalable system that meets the requirements set by its intended consumer is rarely, if ever, about a single component. Even if we were just talking about speed and none of the other dimensions of scalability, there’s a great article by Joe Stump where he calls out language critics about this exact subject. C++ can probably assemble HTML at least a hundred times faster than Ruby, but this has minimal impact on what a user sees because disk reads are slow, talking to a data source is slow, and sending data over the internet is *really* slow. On top of all that, Steve Souders, creator of YSlow, a popular tool for benchmarking perceived speed of web sites, argues that for most sites, 90% of the wait time a user has to put up with is a result of loading assets, JavaScript and stylesheets. Rails and proper database constraints can ensure data integrity (reliability). Rails is by default stateless and sessions can be centralized (fault tolerance and horizontal scalability), check. Rails can scale, and does scale. Twitter is not replacing their web tier because it is fast enough, and they have focused on optimizing their middle tier, which sits between their web layer and their data layer. There’s no point in replacing this layer. Its performance and reliability are completely bound by the data and middleware layers.

This is where Scala enters the picture: middleware. Likely this refers to a non-RDBMS datasource serving denormalized social graph, a dispatching component for pushing messages through SMS and email, a queue and queue workers. There are a few requirements here: concurrency and computational speed, both of which are areas where Ruby falls flat because MRI Ruby, the stock VM, uses green threads which block on I/O and do not make use of multicore processors. In addition, Ruby’s memory requirements are aggressive. Scala, on the other hand, is fast and can is as concurrent as the JVM will allow, which is pretty damned concurrent.

So what Twitter did was optimize the bottleneck and leave their investment in the front-end intact. This shouldn’t surprise anybody. Facebook started out on PHP, but now their backend is an amalgam of C/C++, Erlang and other languages. Google runs Java and C++ for their backend, but I’m told several web tier services are written in Python. And so forth. This doesn’t mean don’t use Python, and don’t use PHP. It just means to be ready to optimize and possibly to replace those components when the time is right, which, for many startups, is a LOOOOOOONG way off.

In fact, I still push Ruby on Rails as the development platform of choice for web startups launching a 1.0. Why? There are many reasons. Here are a few:

  • You will hit the ground running fast, and you will have something working within a single development cycle. If your application has any kind of boilerplate, like user management, pagination, multiformat output, simple AJAX or CRUD functionality, Rails just saved you from having to write most of it.
  • Onboarding developers is fast. I’ve been on projects where new developers have been able to be productive the first or second day of looking at a project’s structure and tests
  • Rails emphasizes the importance of test driving development, and any good Rails developer will feel wrong writing code without corresponding tests – when I started Scala, I settled on the NetBeans IDE because it seemed like the easiest to get going with JUnit
  • Solving scalability problems too early is a bad idea. Nobody starts out with a sharded database. Instead of building features or attracting customers to a usable product, you spend your time building an incredibly complex system that will scale but at severe opportunity cost if it hits the market late. You want scalability problems, because it means you are growing too fast. In the words of The Wire’s Marlo Stanfield, “that sound like one of them good problems.”

– Ikai

Written by Ikai Lan

April 2, 2009 at 9:58 pm

Posted in Ruby on Rails, Scalability

Tagged with , ,

First impressions of Lift Scala web development framework from a Ruby on Rails developer

with 8 comments

Over the past few months I’ve been hearing a lot about Scala and, in general, very interested. Scala, short for “scalable language” (and why I will continue to pronounce it “skay-lah” rather than “skah-lah”) is a strongly typed JVM language that combines aspects of functional programming and dynamic languages with static typing and the JVM to provide a language that has some of the flexibility of languages such as Ruby or Haskell, but with the performance and interoperability of Java.

One of the projects leading the charge in Scala is Lift, a web development framework. I’ve been developing in Ruby on Rails for the past few years, and really, I don’t need to learn another framework. I took a very close examination at Django and even built a few projects, but stuck with Rails as my primary tool of choice. Lift interests me for the following reasons:

  • can run in any Java Application Server
  • can run inline Java

Why are these important? I’ve met a lot of consultants who will recommend a solution for a client that requires a deployment mod_php/Erlang/WSGI/Mongrel and got the project shot down. But switch the pitch to Java running in an application server? Happy client. I’ve been pushing JRuby hard on every other Rails developer I meet, so these requirements are not high on my list, though they add a lot of points. Also – it should be worth mentioning that many, many Java based frameworks such as Grails or the recently open sourced AribaWeb can do these.

  • high performance* (have not seen with my own eyes)
  • uses Scala
  • out of the box Comet support

Comet is a way of simulating push applications over HTTP using a browser, which is completely a client pull type of application. Comet is also known as long polling, and it is how every single JavaScript browser chat application works (Meebo, Facebook chat, etc). Basically, the client JavaScript opens an XMLHttpRequest (XHR for short) to the server which the server does not respond to until there is data to be pushed (hence the name “long polling”). This gets rid of the need for clients to poll the server at intervals which has two problems: at longer intervals, data is not pushed as quickly, which would make an IM application unusable. At shorter intervals, browsers would quickly saturate their network connection as well as the server. Comet isn’t without its problems. Most Java applications, for instance, use a “one thread per request” model for each open connection, which does not scale efficiently as the threads would be the limiting factor in the number of clients that could connect at once even though most of the time the threads would be idle. In fact, open connections was such a problem that Facebook’s chat implementation is completely written in Erlang, a concurrent language that can create millions of lightweight processes and was really the only implementation that could scale efficiently to their needs. They blog about it here: http://www.facebook.com/note.php?note_id=14218138919

The way Lift deals with Comet scaling issues is by making use of Jetty Continuations. The thread-per-request model is still used, however, threads are suspended when they are not needed, resulting in a much more efficient use of resources.

It’s these reasons that make Lift appealing to me to learn. However, after fiddling with Lift, there are a few things off the top of my head that I already don’t like much about Lift or are so different they threaten to make my brain reboot:

  • Servers are not stateless. If you need to horizontally scale, your load balancer needs to read the JSESSIONID parameter in the HTTP request and direct traffic based on that information. I’ve been told that Lift is so incredibly high performance this isn’t necessary. This doesn’t answer the question of hot failover, and frankly, I was a bit disappointed.
  • Everything depends on state! This is probably WHY Lift can be so high performance. Most web frameworks deal with requests as they come in, looking up the same data per request to reinstantiate session objects, User objects, or any other objects that need to persist longer than a web request. It’s a completely different way of thinking about problems and web development that experienced web developers coming into Lift from other frameworks will have to come in with a blank slate. Lift encourages abstracting away the request/response cycle. It remains to be seen whether this is a good thing or a bad thing.
  • Unintuitive way to add new pages. You have to add to a sitemap in what might be the most unintuitive manner possible:
    val entries = Menu(Loc("Home", List("index"), "Home")) :: Menu(Loc("Test", List("test"), "Test")) :: User.sitemap

    The :: is the operator for list concatenation. This will make all your pages appear in the site menu. To make pages that don’t appear in the site menu?

    Menu(Loc("MyHiddenPage", List("hidden"), "hidden", Hidden))

    To me this seems unintuitive, but then again, I don’t actually understand what is happening here, as the API docs are unusable. I haven’t figured out routing yet, for instance. When I was learning Django, I figured out how to set up routes to functions in minutes.

  • Scala shorthand. I’ve mentioned this and all the functional programmers have screamed bloody murder. Code is for human beings, not cyborgs. I understand that it’s clever to save keystrokes:
    list.sort(_ < _)

    As opposed to, say, Ruby:

    list.sort { |a, b| a < b }

    But these are trivial examples. There’s code all over the place that looks like this:

    fun1(a, b _).call(_).fun2(_ <= _)

    I’m sorry, but that’s not very welcoming. The worst part is that Scala HAS verbose syntax. You can use underscore notation, or you can use:

    () => something
     (x) => x.something

    I was at a job interview once where I was asked to write code to solve a problem using any language I wanted. I wrote a monstrous Ruby one-liner that was fun to write but that I would never, ever write in real life if I expected other developers to read my code. Sometimes there really is such a thing as too clever.

  • Dearth of working tutorials or documentation* (I plan on writing tutorials for as long as I am learning or interested in Lift)

In spite of these things I still think Lift has an interesting approach to many of the problems of web development. Rather than judge Lift based on my initial impressions, I’ll get into it some more before I decide if it’s a technology that I’d push. Ruby On Rails was one such technology, and in spite of all of its problems I still view it as an amazing development platform. The difference here is that by the time I started learning, several books had already been written, and there were plenty of tutorials on the web.

One of the problems with Lift is that most of the tutorials I have seen so far are written by actual developers of Lift. As a developer, it’s hard to figure out what people need to know. There’s a tendency to say to new developers coming on, “You only need to know A, B and C.” Then, thirty minutes later, when the new guy is completely lost, “Oh, and D! Sorry!” As a newbie, trust me, I won’t miss D. It’ll be in my face such that I’ll get mad, stop coding, then complain on Twitter, on the mailing list and in this blog.

As for the immediate future, I’ll continue writing tutorials for Lift, beginning with a tutorial coming soon about how to get started developing on Lift with NetBeans. Stay tuned.

Written by Ikai Lan

March 3, 2009 at 8:39 pm

Posted in Lift, Scala

Tagged with ,

First post, Wordpress support rocks

with one comment

I had an issue with the domain, and within 15 minutes, not one, but TWO folks at Automattic responded to my request and resolved the issue. Thanks guys! You guys are awesome. Hope you guys are always going to respond like this!

Ikai

Written by Ikai Lan

December 27, 2008 at 12:03 pm

Posted in Uncategorized