Ikai Lan says

Follow me on Twitter: http://www.twitter.com/ikai

Archive for the ‘Software Development’ Category

Introduction to working with App Engine’s low-level datastore API

with 9 comments

App Engine’s Java SDK ships with three different mechanisms for persisting data:

  • JPA – the javax.persistence.* package
  • JDO – Java Data Objects
  • The low-level API

The formal documentation has got some good examples for working with JDO and JPA, but the documentation for working with the low-level API is still a tad sparse. The original purpose of the low-level API was to provide developers a way to build libraries that could do persistence or even build persistence libraries themselves – alternative persistence mechanisms such as Objectify, Twig, SimpleDS and Slim3 all build on top of this API.

For most developers, it may be simpler to use either JDO, JPA or a third-party library, but there are cases in which the low-level API is useful. This post will be a beginner’s guide to writing and retrieving data using this API – we’ll save more advanced topics for future posts.

For those newer to App Engine, let’s define a few terms before we continue:

Entity - An entity is an object representation of a datastore row. Unlike a row in a relational database, there are no predefined columns. There’s only one giant Bigtable, and your entities are all part of that table.

Entity Kind – There are no tables corresponding to types of data. The Kind of the entity is stored as part of the Entity Key.

Entity Key - The primary way by which entities are fetched – even when you issue queries, the datastore does a batch get by key of entities. It’s similar to a primary key in an RDBMS. The Key encodes your application ID, your Entity’s Kind, any parent entities and other metadata. Description of the key is out of scope of this article, but you’ll be able to find plenty of content about Keys when you refer to your favorite search engine.

Properties – Entities don’t have columns – they have properties. A property represents a field of your Entity. For instance, a Person Entity would have a Kind of Person, a Key corresponding to a unique identifier corresponding to their name (for all real world scenarios, this is only true for me, as I’m the only Ikai Lan in the world), and Properties: age, height, weight, eye color, etc.

There are a lot more terms, but these are the ones we’ll be using frequently in this article. Let’s describe a few key features of the low-level API which differ from using a higher level tool such as the JDO and JPA interfaces. Depending on your point of view, these could be either advantages or disadvantages:

  • Typeless entities. Think of an Entity as a Java class with a Key property (datastore Key), a Kind property (String) and Properties (HashMap of Properties). This means that for a given entity kind, it is possible to have two different entities with completely different properties. You could have a Person entity that defines age and weight as its properties, and a separate Person entity that defines height and eye color.
  • No Managers. You instantiate a DatastoreService from a DatastoreServiceFactory, then you get(), put() and query()*. No need to worry about opening or closing managers, detaching objects, marking items dirty, and so forth.
  • Lower startup time. For lower traffic Java websites, loading a PersistenceManagerFactory or EntityManagerFactory can incur additional startup time cost.

We’ll cover queries in a future post. In this post, we’ll just use get() and put(). In this article, we’ll treat App Engine’s datastore as if it were just a giant Map. Frankly, this isn’t a bad simplication – at its lowest level, Bigtable is a key-value store, which means that the Map abstraction isn’t too far from reality.

Let’s create two Entities representing Persons. We’ll name them Alice and Bob. Let’s define them now:

import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Entity alice = new Entity("Alice", "Person");
alice.setProperty("gender", "female");
alice.setProperty("age", 20);

Key bobKey = KeyFactory.createKey("Person", "Bob");
Entity bob = new Entity(bobKey);
bob.setProperty("gender", "male");
bob.setProperty("age", "23");

What we’ve demonstrated here are two of the basic ways to create entities. Entity contains five constructors. We’re just demonstrating two of them here.

We’re defining Alice with a raw constructor. We’re passing two Strings: her key name as well as her kind. As we mentioned before – Entities are typeless, and we can specify just about any String as her type. Effectively, this means that the number of kinds we can have is limited only by the number of kinds that we need, and as long as we don’t lose track of them, we could potentially have hundreds of different kinds without having to create a class for each one. We could even define new kinds at runtime, if we so dared. The key name is what we’ll use to retrieve Alice later on when we need her again. Think of it as a Map or Dictionary Key. Once we have an Entity object, we need to define her properties. For now, we’ll define her gender and her age. Note that, again, Properties behave like Maps, and this means that not only can Entities have hundreds of types of different properties, we could also create new properties at runtime at the expense of compiler type-safety. Choose your poison carefully.

We’re creating Bob’s instance a bit differently, but not too differently. Using KeyFactory’s static createKey method, we create a Key instance. Note the constructor arguments – they are exactly the same: a kind and a key name. In our simple example, this doesn’t really give us any additional benefits except for adding an additional line of code, but more advanced usages in which we may want to create an Entity with a parent, this technique may result in more clear code. And again – we set Bob’s properties using something similar to a Map.

If you’ve been reading Entity’s Javadoc or following along in your IDE, you’ve probably realized by now that Entity does not contain setKey() or setKind() methods. This is because an Entity’s key is immutable. Once an Entity has a key, it can never be changed. You cannot retrieve an Entity from the datastore and change its key – you must create a new Entity with a new Key and delete the old Entity. This is also true of Entities instantiated in local memory.

Speaking of unsaved Entities, let’s go ahead and save them now. We’ll create an instance of the Datastore client and save Alice and Bob:

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Entity alice = new Entity("Person", "Alice");
alice.setProperty("gender", "female");
alice.setProperty("age", 20);

Key bobKey = KeyFactory.createKey("Person", "Bob");
Entity bob = new Entity(bobKey);
bob.setProperty("gender", "male");
bob.setProperty("age", "23");

DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
datastore.put(alice);
datastore.put(bob);

That’s it! DatastoreService’s put() method returns a Key that we can use.

Now let’s demonstrate retrieving Alice and Bob by Key from another class:

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.EntityNotFoundException;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Key bobKey = KeyFactory.createKey("Person", "Bob");
Key aliceKey = KeyFactory.createKey("Person", "Alice");

DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
Entity alice, bob;

try {
    alice = datastore.get(aliceKey);
    bob = datastore.get(bobKey);

    Long aliceAge = (Long) alice.getProperty("age");
    Long bobAge = (Long) bob.getProperty("age");
    System.out.println(”Alice’s age: “ + aliceAge);
    System.out.println(”Bob’s age: “ + bobAge);
} catch (EntityNotFoundException e) {
    // Alice or Bob doesn't exist!
}

The DatastoreService instance’s get() method takes a Key parameter; this is the same parameter we used earlier to construct the Entity representing Bob! This methods throws an EntityNotFoundException. We retrieve individual properties using the Entity class’s getProperty() method – in the case of age, we cast this to a Long.

So there you have it: the basics of working with the low-level API. I’ll likely add more articles in the future about queries, transactions, and more advanced things you can do.

Written by Ikai Lan

June 3, 2010 at 6:46 pm

JRuby In-Memory Search Example With Lucene 3.0.1

with 2 comments

Just for giggles I decided to port the In-Memory search example from my last blog post to JRuby. It’s been some time since I’ve used JRuby for anything, but the team has still been hard at work making strides towards better Java interoperability and ease of use. I downloaded JRuby 1.5.0_RC1, pointed my PATH to the /bin directory, and began hacking.

I’m incredibly impressed with the level of Java interop and startup speed improvements. Kudos to the JRuby team. Integrating Java couldn’t have been easier.

The example is below. Run it with the command:


jruby -r /path/to/lucene-core-3.0.1.jar inmemory.rb

require 'java'
# You either use the next line by require the JAR file, or you pass
# the -r flag to JRuby as follows:
# jruby -r /path/to/lucene-core-3.0.1.jar inmemory.rb 
# require "lucene-core-3.0.1.jar"

java_import org.apache.lucene.analysis.standard.StandardAnalyzer
java_import org.apache.lucene.document.Document
java_import org.apache.lucene.document.Field
java_import org.apache.lucene.index.IndexWriter
java_import org.apache.lucene.queryParser.ParseException
java_import org.apache.lucene.queryParser.QueryParser
java_import org.apache.lucene.store.RAMDirectory
java_import org.apache.lucene.util.Version

java_import org.apache.lucene.search.IndexSearcher
java_import org.apache.lucene.search.TopScoreDocCollector


def create_document(title, content)
  doc = Document.new
  doc.add Field.new("title", title, Field::Store::YES, Field::Index::NO)
  doc.add Field.new("content", content, Field::Store::YES, Field::Index::ANALYZED)  
  doc
end

def create_index
  idx     = RAMDirectory.new
  writer  = IndexWriter.new(idx, StandardAnalyzer.new(Version::LUCENE_30), IndexWriter::MaxFieldLength::LIMITED)

  writer.add_document(create_document("Theodore Roosevelt",
          "It behooves every man to remember that the work of the " +
                  "critic, is of altogether secondary importance, and that, " +
                  "in the end, progress is accomplished by the man who does " +
                  "things."))
  writer.add_document(create_document("Friedrich Hayek",
          "The case for individual freedom rests largely on the " +
                  "recognition of the inevitable and universal ignorance " +
                  "of all of us concerning a great many of the factors on " +
                  "which the achievements of our ends and welfare depend."))
  writer.add_document(create_document("Ayn Rand",
          "There is nothing to take a man's freedom away from " +
                  "him, save other men. To be free, a man must be free " +
                  "of his brothers."))
  writer.add_document(create_document("Mohandas Gandhi",
          "Freedom is not worth having if it does not connote " +
                  "freedom to err."))

  writer.optimize
  writer.close
  idx
end

def search(searcher, query_string)
  parser = QueryParser.new(Version::LUCENE_30, "content", StandardAnalyzer.new(Version::LUCENE_30))
  query = parser.parse(query_string)
  
  hits_per_page = 10
  
  collector = TopScoreDocCollector.create(5 * hits_per_page, false)
  searcher.search(query, collector)
  
  # Notice how this differs from the Java version: JRuby automagically translates
  # underscore_case_methods into CamelCaseMethods, but scoreDocs is not a method:
  # it's a field. That's why we have to use CamelCase here, otherwise JRuby would
  # complain that score_docs is an undefined method.
  hits = collector.top_docs.scoreDocs
  
  hit_count = collector.get_total_hits
    
  if hit_count.zero?
    puts "No matching documents."
  else
    puts "%d total matching documents" % hit_count
    
    puts "Hits for %s were found in quotes by:" % query_string
    
    hits.each_with_index do |score_doc, i|
      doc_id = score_doc.doc
      doc_score = score_doc.score
      
      puts "doc_id: %s \t score: %s" % [doc_id, doc_score]
      
      doc = searcher.doc(doc_id)
      puts "%d. %s" % [i, doc.get("title")]
      puts "Content: %s" % doc.get("content")
      puts
      
    end
    
  end

end

def main
  index = create_index
  searcher = IndexSearcher.new(index)

  search(searcher, "freedom")
  search(searcher, "free");
  search(searcher, "progress or achievements");
  search(searcher, "ikaisays.com")

  searcher.close
end

main()

Written by Ikai Lan

April 25, 2010 at 7:49 pm

Posted in JRuby, JRuby, Ruby, Software Development

Tagged with , ,

Lucene In-Memory Search Example: Now updated for Lucene 3.0.1

with 3 comments

Update: Here’s a link to some sample code for Python using PyLucene. Thanks, Joseph!

While playing around with Lucene in my experiments to make it work with Google App Engine, I found an excellent example for indexing some text using Lucene in-memory; unfortunately, it dates back to May 2004 (!!!). I’ve updated the example to work with the newest version of Lucene, 3.0.1. It’s below for reference.

The Pastie link for the code snippet can be found here.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import java.io.IOException;

public class LuceneTest{

   public static void main(String[] args) {
      // Construct a RAMDirectory to hold the in-memory representation
      // of the index.
      RAMDirectory idx = new RAMDirectory();

      try {
         // Make an writer to create the index
         IndexWriter writer =
                 new IndexWriter(idx, 
                         new StandardAnalyzer(Version.LUCENE_30), 
                         IndexWriter.MaxFieldLength.LIMITED);

         // Add some Document objects containing quotes
         writer.addDocument(createDocument("Theodore Roosevelt",
                 "It behooves every man to remember that the work of the " +
                         "critic, is of altogether secondary importance, and that, " +
                         "in the end, progress is accomplished by the man who does " +
                         "things."));
         writer.addDocument(createDocument("Friedrich Hayek",
                 "The case for individual freedom rests largely on the " +
                         "recognition of the inevitable and universal ignorance " +
                         "of all of us concerning a great many of the factors on " +
                         "which the achievements of our ends and welfare depend."));
         writer.addDocument(createDocument("Ayn Rand",
                 "There is nothing to take a man's freedom away from " +
                         "him, save other men. To be free, a man must be free " +
                         "of his brothers."));
         writer.addDocument(createDocument("Mohandas Gandhi",
                 "Freedom is not worth having if it does not connote " +
                         "freedom to err."));

         // Optimize and close the writer to finish building the index
         writer.optimize();
         writer.close();

         // Build an IndexSearcher using the in-memory index
         Searcher searcher = new IndexSearcher(idx);

         // Run some queries
         search(searcher, "freedom");
         search(searcher, "free");
         search(searcher, "progress or achievements");

         searcher.close();
      }
      catch (IOException ioe) {
         // In this example we aren't really doing an I/O, so this
         // exception should never actually be thrown.
         ioe.printStackTrace();
      }
      catch (ParseException pe) {
         pe.printStackTrace();
      }
   }

   /**
    * Make a Document object with an un-indexed title field and an
    * indexed content field.
    */
   private static Document createDocument(String title, String content) {
      Document doc = new Document();

      // Add the title as an unindexed field...

      doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));


      // ...and the content as an indexed field. Note that indexed
      // Text fields are constructed using a Reader. Lucene can read
      // and index very large chunks of text, without storing the
      // entire content verbatim in the index. In this example we
      // can just wrap the content string in a StringReader.
      doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));

      return doc;
   }

   /**
    * Searches for the given string in the "content" field
    */
   private static void search(Searcher searcher, String queryString)
           throws ParseException, IOException {

      // Build a Query object
      QueryParser parser = new QueryParser(Version.LUCENE_30, 
              "content", 
              new StandardAnalyzer(Version.LUCENE_30));
      Query query = parser.parse(queryString);


      int hitsPerPage = 10;
      // Search for the query
      TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);
      searcher.search(query, collector);

      ScoreDoc[] hits = collector.topDocs().scoreDocs;

      int hitCount = collector.getTotalHits();
      System.out.println(hitCount + " total matching documents");

      // Examine the Hits object to see if there were any matches

      if (hitCount == 0) {
         System.out.println(
                 "No matches were found for \"" + queryString + "\"");
      } else {
         System.out.println("Hits for \"" +
                 queryString + "\" were found in quotes by:");

         // Iterate over the Documents in the Hits object
         for (int i = 0; i < hitCount; i++) {
            ScoreDoc scoreDoc = hits[i];
            int docId = scoreDoc.doc;
            float docScore = scoreDoc.score;
            System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);

            Document doc = searcher.doc(docId);

            // Print the value that we stored in the "title" field. Note
            // that this Field was not indexed, but (unlike the
            // "contents" field) was stored verbatim and can be
            // retrieved.
            System.out.println("  " + (i + 1) + ". " + doc.get("title"));
            System.out.println("Content: " + doc.get("content"));            
         }
      }
      System.out.println();
   }
}

In progress: still trying to figure out how to get some version of Lucene working on App Engine for Java. My thoughts:

  • Use an In Memory index
  • Serialize to Memcache or the Datastore (not even sure how to do this right now)

Granted, there are limitations to this: if an App Engine application exceeds some memory limit, a SoftMemoryExceeded exception will be thrown. Also – I’m doubtful of the ability to update indexes incrementally in the datastore: not to mention, there’s a 1mb limit on datastore entries. The Blobstore, accessed programmatically, may not have the latency required. Still – it’s an interesting thought experiment, and there’s probably some compromise we can find with a future feature of App Engine that’ll allow us to make Lucene actually usable. We just have to think of it. Stay tuned. I’ll write another post if I can get even a proof-of-concept to work.

Written by Ikai Lan

April 24, 2010 at 8:32 am

Using pattern matching with regular expressions in Scala

with 10 comments

I’ve been trying to use Scala more and more so I can gain some experience and exposure to it. A couple of weeks ago, I wrote a Scala log parser for Ruby on Rails. It is terribly newbie-ish – the classes are mutable and it’s disorganized. It’s a mess. Jorge Ortiz from the Scala mailing list was kind enough to rewrite it in a more Scala style. It completely blew my mind how terse Scala can become when written correctly.

It bothered me, however, dealing with regular expressions the way that I did. The Java interface is pretty clumsy and nowhere near as clean as regular expression pattern extraction in Perl or Ruby.

As it turns out, it’s surprisingly easy to extract text using Regular Expressions in Scala. Throw away Pattern.compile! Check out this hotness below:

First, let’s import Scala’s regex package:

import scala.util.matching.Regex

Now we declare a regular expression to match against. We can do this one of two ways:

val LogEntry = new Regex("""Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""")

I use triple quotes here to signify that I am creating a raw string. A raw string means that I do not need to escape characters like the \ character. If I didn't do this, I'd be forced to use strings like "\\d+". Believe it or not, that extra slash throws me off. Just goes to show that I have written way too many parsers.

Alternatively, I can declare a new Regex by doing this:

val LogEntry = """Completed in (\d+)ms \(View: (\d+), DB: (\d+)\) \| (\d+) OK \[http://app.domain.com(.*)\?.*""".r

Strings have a method called "r", which will convert it to a Regex object. I'm not sold on this syntax at the moment, since it doesn't play well with eyeball scans, but I'm putting it here for those folks that absolutely need to save characters.

There's nothing really special here yet. The next step is REALLY cool:

val line = "Completed in 100ms (View: 25, DB: 75) | 200 OK [http://app.domain.com?params=here]"

scala> val LogEntry(totalTime, viewTime, dbTime, responseCode, uri) = line
totalTime: String = 100
viewTime: String = 25
dbTime: String = 75
responseCode: String = 200
uri: String =

The local variables totalTime, viewTime, dbTime, responseCode and uri are now bound to the values we want to extract from the original line! The regular expression value defines an unapplySeq method. I’m not quite good enough at Scala to tell you in any definite terms what that means, except that you can use the code in a pattern match:

line match {
case LogEntry(totalTime, viewTime, dbTime, responseCode, uri) => {
/* Process the data */
// do something with totalTime.toInt
// do something with viewTime.toInt
// etc ...
}
case _ => // Do nothing
}

Because you can use a pattern match, and patterns will be be matched in the order of definition, this means that you can create several regular expressions representing lines you want to extract, then process them easily in using pattern matching.

Pretty powerful stuff. What would really make my day would be if someone knew how I could extract the values totalTime, viewTime, and dbTime as integers and not have to do a conversion – I’m already matching with \d+. Ideas?

Written by Ikai Lan

April 4, 2009 at 5:41 pm

Posted in Scala

Tagged with , , ,

Twitter, Ruby on Rails, Scala and people who don’t RTFA

with 33 comments

The Register recently published an article titled, “Twitter jilts Ruby for Scala“, prompting a wave of Tweets (Twitter messages for the jargon challenged) about Ruby on Rails and scaling. More specifically, its lack of ability to do do. The source of this article was a talk given by the API lead at Twitter, Alex Payne, at the Web 2.0 Expo in San Francisco, where he discussed porting parts of Twitter from Ruby to Scala. Naturally, the armchair commentators added their two cents about the message, taking it to mean that Scala is *the* one true language for internet companies, and that Ruby on Rails should not be used because it cannot scale.

I tweeted about my annoyance with retweeters that don’t understand what they are retweeting, and Alex responded: “It’s really frustrating for me. I don’t think the people who were there for my talk got that impression. The press skewed it.”

Alex wasn’t saying that Rails can’t scale. Alex was saying that he really likes Scala, and parts of Twitter that were written in Ruby were rewritten in Scala because some tasks are simply not appropriate for Ruby. I know of Kestrel, their messaging queue, but likely there are other middleware layers that sit between their APIs/web interfaces and their data sources that are being ported.

Now, Scala is an amazing language. It compiles to JVM bytecode, and thus, most procedures will run near or at the speed of Java. Functions can be treated as first class objects and passed around like variables. Its system of type inference means less ceremony for simply declaring variables. Traits function remarkably like mixins in dynamic languages, allowing very rich composite classes that, in many cases, need to add very little boilerplate for the composition, unlike Java interfaces. And pattern matching with wildcards? I can’t even think of an analogy for how powerful of a tool this is. Is it the best compiled language? Some think so, and I can certainly see where they are coming from. I’m learning Scala as we speak. The basics were pretty easy to pick up, but I’m still struggling to be good with it. Scala best practices such as programming in an immutable style are far easier to preach than to actual do.

But Scala is not Ruby, and will never be Ruby. Ruby is probably the most powerful programming language on the planet for creating DSLs, or domain specific languages. It can do this because any class or object can be extended, any method overridden, any constant undefined and redefined, and so on, and so forth. You can take an Integer and declare more methods on it. Duck typing means that you don’t have to set up boilerplate interfaces and abstract classes to create objects that can replace other objects with similar interfaces. A web development platform as powerful and syntactically concise as Rails could only have been done in Ruby. Imagine this: a developer looks at a model file representing a database table, and immediately understands the business rules regarding validity as well as the callbacks that will take place when the object is created, updated or destroyed. Or this: it is possible to create a plugin that extends collections objects to allow very complex pagination, and a client developer only has to include a file or remove the include statement. This plugin recognizes if the collection in question maps to a database, and if so, applies LIMIT and OFFSET statements to SQL queries as to not eagerly load too many objects at once. This plugin exists.

This power does not come without a price. Ruby is slow. Ruby is generally recognized as being twice as slow as Python, and at least an order of magnitude slower than C++ or Java. Charles Nutter, one of the creators of the JRuby VM for the Ruby language, recently posted an article about optimizations that could be made to the VM, and he improves Ruby performance by as much as twenty times by removing much of Ruby’s power and breaking compatibility. If anything, it shows us skeptics that there is no such thing as a free lunch.

So if Ruby is slow, and Rails uses Ruby, does that mean Rails cannot scale? Let me make this clear right now: Though Ruby is slow, this does not mean that Rails does not scale. Scalability is a very difficult concept, often oversimplified to be synonymous with efficiency. Efficiency is a facet of scalability, yes, but it is not its only dimension. Reliability and uptime are very important dimensions of a scalable system. So is horizontal scalability: the ability to serve an increase in usage with a corresponding increase in computational nodes. Though it may seem like a given that throwing hardware at a problem will increase capacity, this simply isn’t so. Anybody that has worked on a large data driven application will tell you that a centralized, authoritative source such as a database cluster is a bottleneck, and simply throwing hardware at the problem results in diminishing marginal returns. Then there is quality of service. A truly scalable architecture will have minimal service degradation as users and usage exponentially increase.

I use the word “architecture” because a truly scalable system that meets the requirements set by its intended consumer is rarely, if ever, about a single component. Even if we were just talking about speed and none of the other dimensions of scalability, there’s a great article by Joe Stump where he calls out language critics about this exact subject. C++ can probably assemble HTML at least a hundred times faster than Ruby, but this has minimal impact on what a user sees because disk reads are slow, talking to a data source is slow, and sending data over the internet is *really* slow. On top of all that, Steve Souders, creator of YSlow, a popular tool for benchmarking perceived speed of web sites, argues that for most sites, 90% of the wait time a user has to put up with is a result of loading assets, JavaScript and stylesheets. Rails and proper database constraints can ensure data integrity (reliability). Rails is by default stateless and sessions can be centralized (fault tolerance and horizontal scalability), check. Rails can scale, and does scale. Twitter is not replacing their web tier because it is fast enough, and they have focused on optimizing their middle tier, which sits between their web layer and their data layer. There’s no point in replacing this layer. Its performance and reliability are completely bound by the data and middleware layers.

This is where Scala enters the picture: middleware. Likely this refers to a non-RDBMS datasource serving denormalized social graph, a dispatching component for pushing messages through SMS and email, a queue and queue workers. There are a few requirements here: concurrency and computational speed, both of which are areas where Ruby falls flat because MRI Ruby, the stock VM, uses green threads which block on I/O and do not make use of multicore processors. In addition, Ruby’s memory requirements are aggressive. Scala, on the other hand, is fast and can is as concurrent as the JVM will allow, which is pretty damned concurrent.

So what Twitter did was optimize the bottleneck and leave their investment in the front-end intact. This shouldn’t surprise anybody. Facebook started out on PHP, but now their backend is an amalgam of C/C++, Erlang and other languages. Google runs Java and C++ for their backend, but I’m told several web tier services are written in Python. And so forth. This doesn’t mean don’t use Python, and don’t use PHP. It just means to be ready to optimize and possibly to replace those components when the time is right, which, for many startups, is a LOOOOOOONG way off.

In fact, I still push Ruby on Rails as the development platform of choice for web startups launching a 1.0. Why? There are many reasons. Here are a few:

  • You will hit the ground running fast, and you will have something working within a single development cycle. If your application has any kind of boilerplate, like user management, pagination, multiformat output, simple AJAX or CRUD functionality, Rails just saved you from having to write most of it.
  • Onboarding developers is fast. I’ve been on projects where new developers have been able to be productive the first or second day of looking at a project’s structure and tests
  • Rails emphasizes the importance of test driving development, and any good Rails developer will feel wrong writing code without corresponding tests – when I started Scala, I settled on the NetBeans IDE because it seemed like the easiest to get going with JUnit
  • Solving scalability problems too early is a bad idea. Nobody starts out with a sharded database. Instead of building features or attracting customers to a usable product, you spend your time building an incredibly complex system that will scale but at severe opportunity cost if it hits the market late. You want scalability problems, because it means you are growing too fast. In the words of The Wire’s Marlo Stanfield, “that sound like one of them good problems.”

- Ikai

Written by Ikai Lan

April 2, 2009 at 9:58 pm

Posted in Ruby on Rails, Scalability

Tagged with , ,

First impressions of Lift Scala web development framework from a Ruby on Rails developer

with 8 comments

Over the past few months I’ve been hearing a lot about Scala and, in general, very interested. Scala, short for “scalable language” (and why I will continue to pronounce it “skay-lah” rather than “skah-lah”) is a strongly typed JVM language that combines aspects of functional programming and dynamic languages with static typing and the JVM to provide a language that has some of the flexibility of languages such as Ruby or Haskell, but with the performance and interoperability of Java.

One of the projects leading the charge in Scala is Lift, a web development framework. I’ve been developing in Ruby on Rails for the past few years, and really, I don’t need to learn another framework. I took a very close examination at Django and even built a few projects, but stuck with Rails as my primary tool of choice. Lift interests me for the following reasons:

  • can run in any Java Application Server
  • can run inline Java

Why are these important? I’ve met a lot of consultants who will recommend a solution for a client that requires a deployment mod_php/Erlang/WSGI/Mongrel and got the project shot down. But switch the pitch to Java running in an application server? Happy client. I’ve been pushing JRuby hard on every other Rails developer I meet, so these requirements are not high on my list, though they add a lot of points. Also – it should be worth mentioning that many, many Java based frameworks such as Grails or the recently open sourced AribaWeb can do these.

  • high performance* (have not seen with my own eyes)
  • uses Scala
  • out of the box Comet support

Comet is a way of simulating push applications over HTTP using a browser, which is completely a client pull type of application. Comet is also known as long polling, and it is how every single JavaScript browser chat application works (Meebo, Facebook chat, etc). Basically, the client JavaScript opens an XMLHttpRequest (XHR for short) to the server which the server does not respond to until there is data to be pushed (hence the name “long polling”). This gets rid of the need for clients to poll the server at intervals which has two problems: at longer intervals, data is not pushed as quickly, which would make an IM application unusable. At shorter intervals, browsers would quickly saturate their network connection as well as the server. Comet isn’t without its problems. Most Java applications, for instance, use a “one thread per request” model for each open connection, which does not scale efficiently as the threads would be the limiting factor in the number of clients that could connect at once even though most of the time the threads would be idle. In fact, open connections was such a problem that Facebook’s chat implementation is completely written in Erlang, a concurrent language that can create millions of lightweight processes and was really the only implementation that could scale efficiently to their needs. They blog about it here: http://www.facebook.com/note.php?note_id=14218138919

The way Lift deals with Comet scaling issues is by making use of Jetty Continuations. The thread-per-request model is still used, however, threads are suspended when they are not needed, resulting in a much more efficient use of resources.

It’s these reasons that make Lift appealing to me to learn. However, after fiddling with Lift, there are a few things off the top of my head that I already don’t like much about Lift or are so different they threaten to make my brain reboot:

  • Servers are not stateless. If you need to horizontally scale, your load balancer needs to read the JSESSIONID parameter in the HTTP request and direct traffic based on that information. I’ve been told that Lift is so incredibly high performance this isn’t necessary. This doesn’t answer the question of hot failover, and frankly, I was a bit disappointed.
  • Everything depends on state! This is probably WHY Lift can be so high performance. Most web frameworks deal with requests as they come in, looking up the same data per request to reinstantiate session objects, User objects, or any other objects that need to persist longer than a web request. It’s a completely different way of thinking about problems and web development that experienced web developers coming into Lift from other frameworks will have to come in with a blank slate. Lift encourages abstracting away the request/response cycle. It remains to be seen whether this is a good thing or a bad thing.
  • Unintuitive way to add new pages. You have to add to a sitemap in what might be the most unintuitive manner possible:
    val entries = Menu(Loc("Home", List("index"), "Home")) :: Menu(Loc("Test", List("test"), "Test")) :: User.sitemap

    The :: is the operator for list concatenation. This will make all your pages appear in the site menu. To make pages that don’t appear in the site menu?

    Menu(Loc("MyHiddenPage", List("hidden"), "hidden", Hidden))

    To me this seems unintuitive, but then again, I don’t actually understand what is happening here, as the API docs are unusable. I haven’t figured out routing yet, for instance. When I was learning Django, I figured out how to set up routes to functions in minutes.

  • Scala shorthand. I’ve mentioned this and all the functional programmers have screamed bloody murder. Code is for human beings, not cyborgs. I understand that it’s clever to save keystrokes:
    list.sort(_ < _)

    As opposed to, say, Ruby:

    list.sort { |a, b| a < b }

    But these are trivial examples. There’s code all over the place that looks like this:

    fun1(a, b _).call(_).fun2(_ <= _)

    I’m sorry, but that’s not very welcoming. The worst part is that Scala HAS verbose syntax. You can use underscore notation, or you can use:

    () => something
     (x) => x.something

    I was at a job interview once where I was asked to write code to solve a problem using any language I wanted. I wrote a monstrous Ruby one-liner that was fun to write but that I would never, ever write in real life if I expected other developers to read my code. Sometimes there really is such a thing as too clever.

  • Dearth of working tutorials or documentation* (I plan on writing tutorials for as long as I am learning or interested in Lift)

In spite of these things I still think Lift has an interesting approach to many of the problems of web development. Rather than judge Lift based on my initial impressions, I’ll get into it some more before I decide if it’s a technology that I’d push. Ruby On Rails was one such technology, and in spite of all of its problems I still view it as an amazing development platform. The difference here is that by the time I started learning, several books had already been written, and there were plenty of tutorials on the web.

One of the problems with Lift is that most of the tutorials I have seen so far are written by actual developers of Lift. As a developer, it’s hard to figure out what people need to know. There’s a tendency to say to new developers coming on, “You only need to know A, B and C.” Then, thirty minutes later, when the new guy is completely lost, “Oh, and D! Sorry!” As a newbie, trust me, I won’t miss D. It’ll be in my face such that I’ll get mad, stop coding, then complain on Twitter, on the mailing list and in this blog.

As for the immediate future, I’ll continue writing tutorials for Lift, beginning with a tutorial coming soon about how to get started developing on Lift with NetBeans. Stay tuned.

Written by Ikai Lan

March 3, 2009 at 8:39 pm

Posted in Lift, Scala

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 54 other followers