Clearing up some things about LinkedIn mobile’s move from Rails to node.js
There’s an article on highscalability that’s talking about the move from Rails to node.js (for completeness: its sister discussion on Hacker News). It’s not the first time this information has been posted. I’ve kind of ignored it for now (because I didn’t want to be this guy), but it’s come up enough times and no one has spoken up, so I suppose it’s up to me to clear a few things up.
I was on the team at LinkedIn that was responsible for the mobile server, and while I wasn’t the primary contributor to that stack, I built and contributed several things, such as the unfortunate LinkedIn WebOS app which made use of the mobile server (and a few features) and much of the initial research behind productionizing JRuby for web applications (I did much more stuff that wasn’t published). I left LinkedIn in 2009, so I apologize if any new information has surfaced. My hunch is that even if I’m off, I’m not off by that much.
Basically: the article is leaving out several facts. We can all learn something from the mobile server and software engineering if we know the full story behind the whole thing.
In 2008, I joined a software engineering team that LinkedIn that was focused on building things outside the standard Java stack. You see, back then, to develop code for linkedin.com, you needed a Mac Pro with 6gigs of RAM just to run your code. And those requirements kept growing. If my calculations are correct, the standard setup for engineers now is a machine with 20 or more gigabytes of RAM just to RUN the software. In addition, each team could only release once every 6 weeks (this has been fixed in the last few years). It was deemed that we needed to build out a platform off the then-fledgling API and start creating teams to get get off the 6 week release cycle so we could iterate quickly on new features. The team I was on, LED, was created for this purpose.
Our first projects was a rotating globe that showed off new members joining LinkedIn. It used to run Poly9, but when they got shut down, it looks like someone migrated it to use Google Earth. The second major project was m.linkedin.com, a mobile web client for LinkedIn that would be one of the major clients of our fledgling API server, codenamed PAL. Given that we were building out an API for third parties, we figured that we could eat our own dogfood and build out LinkedIn for mobile phones with browsers. This is 2008, mind you. The iPhone just came out, and it was a very Blackberry world.
The stack we chose was Ruby on Rails 1.2, and the deployment technology was Mongrel. Remember, this is 2008. Mongrel was cutting edge Ruby technology. Phusion Passenger wasn’t released yet (more on this later), and Mongrel was light-years ahead of FastCGI. The problem with Mongrel? It’s single-threaded. It was deemed that the cost of shipping fast was more important than CPU efficiency, a choice I agreed with. We were one of the first products at LinkedIn to do i18n (well, we only did translations) via gettext. We deployed using Capistrano, and were the first ones to use nginx. We did a lot of other cool stuff, like experiment with Redis, learn a lot about memcached in production (nowadays this is a given, but there was a lot of memcached vs EHCache talk back then). Etc, etc. But I’m not trying to talk about how much fun I had on that team. Well, not primarily.
I’m here to clear up facts about the post about moving to node.js. And to do that, I’m going to back to my story.
The iPhone SDK had shipped around that time. We didn’t have an app ready for launch, but we wanted to build one, so our team did, and we inadvertantly became the mobile team. So suddenly, we decided that this array of Rails server that made API calls to PAL (which was, back then, using a pre-OAuth token exchange technology that was strikingly similar) would also be the primary API server for the iPhone client and any other rich mobile client we’d end up building, this thing that was basically using Rails rhtml templates. We upgraded to Rails 2.x+ so we could have the respond_to directive for different outputs. Why didn’t we connect the iPhone client directly to PAL? I don’t remember. Oh, and we also decided to use OAuth for authenticating the iPhone client. Three legged OAuth, so we also turned those Rails servers into OAuth providers. Why did we use 3-legged OAuth? Simple: we had no idea what we were doing. I’LL ADMIT IT.
Did I mention that we hosted outside the main data centers? This is what Joyent talks about when they say they supplied LinkedIn with hosting. They never hosted linkedin.com proper on Joyent, but we had a long provisioning process for getting servers in the primary data center, and there were these insane rules about no scripting languages in production, so we decided it was easier to adopt an outside provider when we needed more capacity.
Here’s what you were accessing if you were using the LinkedIn iPhone client:
iPhone -> m.linkedin.com (running on Rails) -> LinkedIn’s API (which, for all intents and purposes, only had one client, us)
That’s a cross data center request, guys. Running on single-threaded Rails servers (every request blocked the entire process), running Mongrel, leaking memory like a sieve (this was mostly the fault of gettext). The Rails server did some stuff, like translations, and transformation of XML to JSON, and we tested out some new mobile-only features on it, but beyond that it didn’t do a lot. It was a little more than a proxy. A proxy with a maximum concurrency factor dependent on how many single-threaded Mongrel servers we were running. The Mongrel(s), we affectionately referred to them, often bloated up to 300mb of RAM each, so we couldn’t run many of them.
At this time, I was busy productionizing JRuby. JRuby, you see was taking full advantage of Rails’ ability to serve concurrent requests using JVM concurrency. In addition, JRuby outperformed MRI in almost every real benchmark I threw at it – there were maybe 1 or 2 specific benchmarks when it didn’t. I knew that if we ported the mobile server to JRuby, we could have gotten more performance and gotten way more concurrency. We would have kept the same ability to deploy fast with the option to in-line into many of the Java libraries LinkedIn was using.
But we didn’t. Instead, the engineering manager at the time ruled in favor of Phusion Passenger, which, to be fair, was an easier port than JRuby. We had come to depend on various native extensions, gettext being the key one, and we didn’t have time to port the translations to something that was JRuby friendly. I was furious, of course, because I had been evangelizing JRuby as the best Ruby production environment and no one was listening, but that’s a different story for a different time. Well, maybe some people listened; those Square guys come to mind.
This was about the time I left LinkedIn. As far as I know, they didn’t build a ton more features. Someone told me that one of my old teammates suddenly became fascinated with node.js, and pretty much singlehandedly decided to rewrite the mobile server using node. Node was definitely a better fit for what we were doing, since we were constantly blocking on a cross data center call, and non blocking server for IO has been shown to be highly advantageous from a performance perspective. Not to mention: we never intended for the original Ruby on Rails server to be used as a proxy for several years.
So, knowing all the facts, what are all the takeaways?
- Is v8 faster than MRI? MRI is generally slower than YARV (Ruby 1.9), and, at least in these benchmarks, I don’t think there is any question that v8 is freakin’ fast. If node.js blocked on I/O, however, this fact would have been almost completely irrelevant.
- The rewrite factor. How many of us have been on a software engineering project where the end result looking nothing like what we planned to build in the first place? And, knowing fully the requirements, we know that, if given time and the opportunity to rebuild it from scratch, it would have been way better? Not to mention: I grew a lot at LinkedIn as a software engineer, so the same me several years later would have done a far better job than the same me in 2008. Experience does matter.
- Firefighting? That was probably a combination of several things: the fact that we were running MRI and leaked memory, or the fact that the ops team was 30% of a single guy.
What I’m saying here is use your brain. Don’t read the High Scalability post and assume that you must build your next technology using node.js. It was definitely a better fit than Ruby on Rails for what the mobile server ended up doing, but it is not a performance panacea. You’re comparing a lower level server to a full stack web framework.
That’s all for tonight, folks, and thank you internet for finally goading me out of hiding again.