Ikai Lan says

I say things!

Archive for October 2012

Why “The Real Reason Silicon Valley Coders Write Bad Software” is wrong

with 12 comments

There was an article in The Atlantic this morning titled, “The Real Reason Silicon Valley Coders Write Bad Software” with the tagline, “If someone had taught all those engineers how to string together a proper sentence, Windows Vista would be a lot less buggy.” The author, Bernard Meisler, seems to think that the cause of “bad software” is, and I quote:

“But the downfall of many programmers, not just self-taught ones, is their lack of ability to sustain complex thought and their inability to communicate such thoughts. That leads to suboptimal code, foisting upon the world mediocre (at best) software like Windows Vista, Adobe Flash, or Microsoft Word.”

Not only are the conclusions of the article inaccurate, it paints a negative portrayal of software engineers that isn’t grounded in reality. For starters, there is a distinction between “bad” software and “buggy” software. Software that is bad tends to be a result of poor usability design. Buggy software, on the other hand, is a consequence of a variety of factors stemming from the complexity of modern software. The largest factor in reducing the number of bugs isn’t going to come from improving the skills of individual programmers, but rather, from instituting quality control processes throughout the software engineering lifecycle.

Bad Software

Bad software is software that, for whatever reason, does not meet the expectation of users. Software is supposed to make our lives simpler by crunching data faster, or automating repetitive tasks. Great software is beautiful, simple, and, given some input from a user, produces correct output using a reasonable amount of resources. When we say software is bad, we mean any combination of things: it’s not easy to use. It gives us the wrong output. It uses resources poorly. It doesn’t always run. It doesn’t do the right thing. Bugs contribute to a poor user experience, but are not the sole culprit for the negative experiences that users have with software. Let’s take one of the examples Meisler has cited: Windows Vista. A quick search for “why does windows vista suck” in Google turns up these pages:

http://www.thebuzzmedia.com/why-vista-sucks/
http://www.intelliadmin.com/index.php/2007/01/the-5-sins-of-vista/
http://techgage.com/article/top_8_vista_annoyances

Oh, bugs are on there, and they’re pretty serious. We’ll get to that. But what else makes Windows Vista suck, according to those sites? Overly aggressive security prompts. Overly complex menus (we’ll visit the idea of complexity again later, I promise). None of the menus make sense. Changed network interface. Widgets that are too small to be usable shipping with the system. Rearranged menus. Search that only works on 3 folders. Those last few things aren’t bugs, they’re usability problems. The software underneath is, for the most part, what we software engineers call working as intended. Some specification somewhere designed those features to work that way, and the job of the software engineers is, in many companies, to build the software to that specification, as ridiculous as the specification is. One of my coworkers points out that Alan Cooper, creator of Visual Basic, wrote a great book about this subject titled, “The Inmates are Running The Asylum”. Interestingly enough, his argument is that overly technical people are harmful when they design user interactions, and this results in panels like the Windows Vista search box with fifty different options. But, to be fair, even when user interactions are designed well, just making the software do what the user expects is hard. A simple button might be hiding lots of complex interactions underneath the hood to make the software easy to use. The hilarious and often insightful Steve Yegge, talks about just that in a tangentially related post about complexity in software requirements. Software is generally called “bad” when it does not do what the user expects, and this is something that is really hard to get right.

Buggy Software

Buggy software, on the other hand, is software that does not behave as the software engineer expects or the specification dictates. This is in stark contrast to bad software, which is software that does not behave as the way a user expects. There’s often overlap. A trivial example: let’s suppose an engineer writes a tip calculator for mobile phones that allows a user to enter a dollar amount, and press a “calculate” button, which then causes the application to output 15% of the original bill amount on the screen. Let’s say a user uses the application, enters $100, and presses calculate. The amount that comes out is $1500. That’s not 15%! Or – the user presses calculate, and the application crashes. The user expects $15, but gets $1500. Or no result, because the application ends before it presents output.

Software is buggy partially because of bad documentation, as Meisler asserts, but not primarily because of it. Software isn’t even buggy because programmers can’t express “complex thoughts”, another of Meisler’s gems; all of programming is the ability to “combine simple ideas into compound ideas”. Software is buggy because of problems stemming out of complexity.

All software is built on top of abstractions. That is, someone else is responsible for abstracting away the details such that a programmer does not need to fully understand another system to be able to use it. As a programmer, I do not need to understand how my operating system communicates with the hard drive to save a file, or how my hard disk manages its read/write heads. I don’t need to write code that says, “move the write head” – I write code that says, “write this data to a file to this directory”.  Or maybe I just say, “just save this data” and never worry about files or directories. Abstractions in software are kind of like the organizational structure of a large company. The CEO of a large car manufacturer talks to the executive board about the general direction of the company. The executive staff then break these tasks down into more specific focus area goals for their directors, who then break these tasks into divisional goals for the managers, who then break these tasks into team goals and tasks for the individual employees that actually build, design, test, market, and sell the damn things. To make this more convoluted, it’s not all from top to bottom communication, either. There are plenty of cross team interactions, and interactions between layers of management that cross the chain of command.

To say that poor documentation is the primary source of bugs is laughable. Documentation exists to try to make sense of the complexity, but there is no way documentation can be comprehensible in any reasonably complex software with layers of abstraction, because, as Joel Spolsky, founder of Fog Creek Software says, abstractions leak. Programmers cannot know all the different ways abstractions they are depending on will fail, and thus, they cannot possibly report or handle all the different ways the abstraction they are working on will fail. More importantly: programmers cannot know how every possible combination of abstractions they are depending on will produce subtly incorrect results that result in more and more warped results up the abstraction stack. It’s like the butterfly effect.  By the time a bug surfaces, a programmer needs to chase it all the way down the rabbit hole, often into code he does not understand.  Documentation helps, but no programmer reads documentation all the way down to the bottom of the stack before he writes code. It’s not commercially feasible for programmers to do this and retain all the information a priori. Non-trivial software is complex as hell underneath the hood, and it doesn’t help that even seemingly simple software often has to turn water into wine just to try to do what a user expects.

Software engineers and critical thinking

I don’t deny the importance of writing or critical thinking skills. They are crucial. I wouldn’t be surprised if the same ability to reason through complex thoughts allows people to write well as well as program well. But to assert that writing skills lead to reasoning skills? This is a case of placing the cart before the horse. Meisler is dismissive of the intellectual dexterity needed to write programs:

“Most programmers are self-taught and meet the minimum requirement for writing code — the ability to count to eight”

It’s not true. Programming often involves visualizing very abstract data structures, multivariate inputs/outputs, dealing with non-deterministic behavior, and simulating the concurrent interactions between several moving parts in your mind. When I am programming, holding in my mental buffer the state of several objects and context switching several times a second to try to understand how a small change I make in one place will ripple outwards. I do this hundreds of times a session. It’s a near trance-like state that takes me some time to get into before I am working at full speed, and why programming is so damned hard. It’s why I can’t be interrupted and need a contiguous block of time to be fully effective on what Paul Graham calls the maker’s schedule. I’m not the only one who feels this way – many other programmers report experiencing the mental state that psychologists refer to as “flow” when they are performing at their best. 

How to reduce the incidences of bugs

Phillipe Beaudoin, a coworker, writes:

I like to express the inherent complexity of deep software stacks with an analogy, saying that software today is more like biology than mathematics. Debugging a piece of software is more like an episode of House than a clip from A Beautiful Mind. Building great software is about having both good bug prevention processes (code reviews, tests, documentation, etc.) as well as good bug correction processes (monitoring practices, debugging tools).

Trying to find a single underlying cause to buggy software is as preposterous as saying there is a single medical practice that would solve all of earth’s health problems.

Well said.

I’m disappointed in the linkbait title, oversimplification, and broad sweeping generalizations of Bernard Meisler’s article. I’m disappointed that this is how software engineering is being represented to a mainstream, non-techie audience. It’s ironic that the article totes writing skills, but is poorly structured in arguing a point. It seems to conclude that writing skills are the reason code is buggy. No wait – critical thinking. Ah! Nope, surprise, writing skills, and a Steve Jobs quote that is used in a misleading way and taken out of context mixed in for good measure. He argues for the logic of language, but as many of us who also write for fun and profit know, human language is fraught with ambiguity and there’s a lot less similarity between prose and computer programming languages than Meisler would have the mainstream audience believe. I’m sorry, Herr Meisler, but if your article were a computer program, it simply wouldn’t compile.

— Ikai

Written with special thanks to Philippe Beaudoin, Marvin Gouw, Alejandro Crosa, and Tansy Woan.

Written by Ikai Lan

October 9, 2012 at 11:14 pm

Clearing up some things about LinkedIn mobile’s move from Rails to node.js

with 13 comments

There’s an article on highscalability that’s talking about the move from Rails to node.js (for completeness: its sister discussion on Hacker News). It’s not the first time this information has been posted. I’ve kind of ignored it for now (because I didn’t want to be this guy), but it’s come up enough times and no one has spoken up, so I suppose it’s up to me to clear a few things up.

I was on the team at LinkedIn that was responsible for the mobile server, and while I wasn’t the primary contributor to that stack, I built and contributed several things, such as the unfortunate LinkedIn WebOS app which made use of the mobile server (and a few features) and much of the initial research behind productionizing JRuby for web applications (I did much more stuff that wasn’t published). I left LinkedIn in 2009, so I apologize if any new information has surfaced. My hunch is that even if I’m off, I’m not off by that much.

Basically: the article is leaving out several facts. We can all learn something from the mobile server and software engineering if we know the full story behind the whole thing.

In 2008, I joined a software engineering team that LinkedIn that was focused on building things outside the standard Java stack. You see, back then, to develop code for linkedin.com, you needed a Mac Pro with 6gigs of RAM just to run your code. And those requirements kept growing. If my calculations are correct, the standard setup for engineers now is a machine with 20 or more gigabytes of RAM just to RUN the software. In addition, each team could only release once every 6 weeks (this has been fixed in the last few years). It was deemed that we needed to build out a platform off the then-fledgling API and start creating teams to get get off the 6 week release cycle so we could iterate quickly on new features. The team I was on, LED, was created for this purpose.

Our first projects was a rotating globe that showed off new members joining LinkedIn. It used to run Poly9, but when they got shut down, it looks like someone migrated it to use Google Earth. The second major project was m.linkedin.com, a mobile web client for LinkedIn that would be one of the major clients of our fledgling API server, codenamed PAL. Given that we were building out an API for third parties, we figured that we could eat our own dogfood and build out LinkedIn for mobile phones with browsers. This is 2008, mind you. The iPhone just came out, and it was a very Blackberry world.

The stack we chose was Ruby on Rails 1.2, and the deployment technology was Mongrel. Remember, this is 2008. Mongrel was cutting edge Ruby technology. Phusion Passenger wasn’t released yet (more on this later), and Mongrel was light-years ahead of FastCGI. The problem with Mongrel? It’s single-threaded. It was deemed that the cost of shipping fast was more important than CPU efficiency, a choice I agreed with. We were one of the first products at LinkedIn to do i18n (well, we only did translations) via gettext. We deployed using Capistrano, and were the first ones to use nginx. We did a lot of other cool stuff, like experiment with Redis, learn a lot about memcached in production (nowadays this is a given, but there was a lot of memcached vs EHCache talk back then). Etc, etc. But I’m not trying to talk about how much fun I had on that team. Well, not primarily.

I’m here to clear up facts about the post about moving to node.js. And to do that, I’m going to back to my story.

The iPhone SDK had shipped around that time. We didn’t have an app ready for launch, but we wanted to build one, so our team did, and we inadvertantly became the mobile team. So suddenly, we decided that this array of Rails server that made API calls to PAL (which was, back then, using a pre-OAuth token exchange technology that was strikingly similar) would also be the primary API server for the iPhone client and any other rich mobile client we’d end up building, this thing that was basically using Rails rhtml templates. We upgraded to Rails 2.x+ so we could have the respond_to directive for different outputs. Why didn’t we connect the iPhone client directly to PAL? I don’t remember. Oh, and we also decided to use OAuth for authenticating the iPhone client. Three legged OAuth, so we also turned those Rails servers into OAuth providers. Why did we use 3-legged OAuth? Simple: we had no idea what we were doing. I’LL ADMIT IT.

Did I mention that we hosted outside the main data centers? This is what Joyent talks about when they say they supplied LinkedIn with hosting. They never hosted linkedin.com proper on Joyent, but we had a long provisioning process for getting servers in the primary data center, and there were these insane rules about no scripting languages in production, so we decided it was easier to adopt an outside provider when we needed more capacity.

Here’s what you were accessing if you were using the LinkedIn iPhone client:

iPhone -> m.linkedin.com (running on Rails) -> LinkedIn’s API (which, for all intents and purposes, only had one client, us)

That’s a cross data center request, guys. Running on single-threaded Rails servers (every request blocked the entire process), running Mongrel, leaking memory like a sieve (this was mostly the fault of gettext). The Rails server did some stuff, like translations, and transformation of XML to JSON, and we tested out some new mobile-only features on it, but beyond that it didn’t do a lot. It was a little more than a proxy. A proxy with a maximum concurrency factor dependent on how many single-threaded Mongrel servers we were running. The Mongrel(s), we affectionately referred to them, often bloated up to 300mb of RAM each, so we couldn’t run many of them.

At this time, I was busy productionizing JRuby. JRuby, you see was taking full advantage of Rails’ ability to serve concurrent requests using JVM concurrency. In addition, JRuby outperformed MRI in almost every real benchmark I threw at it – there were maybe 1 or 2 specific benchmarks when it didn’t. I knew that if we ported the mobile server to JRuby, we could have gotten more performance and gotten way more concurrency. We would have kept the same ability to deploy fast with the option to in-line into many of the Java libraries LinkedIn was using.

But we didn’t. Instead, the engineering manager at the time ruled in favor of Phusion Passenger, which, to be fair, was an easier port than JRuby. We had come to depend on various native extensions, gettext being the key one, and we didn’t have time to port the translations to something that was JRuby friendly. I was furious, of course, because I had been evangelizing JRuby as the best Ruby production environment and no one was listening, but that’s a different story for a different time. Well, maybe some people listened; those Square guys come to mind.

This was about the time I left LinkedIn. As far as I know, they didn’t build a ton more features. Someone told me that one of my old teammates suddenly became fascinated with node.js, and pretty much singlehandedly decided to rewrite the mobile server using node. Node was definitely a better fit for what we were doing, since we were constantly blocking on a cross data center call, and non blocking server for IO has been shown to be highly advantageous from a performance perspective. Not to mention: we never intended for the original Ruby on Rails server to be used as a proxy for several years.

So, knowing all the facts, what are all the takeaways?

  • Is v8 faster than MRI? MRI is generally slower than YARV (Ruby 1.9), and, at least in these benchmarks, I don’t think there is any question that v8 is freakin’ fast. If node.js blocked on I/O, however, this fact would have been almost completely irrelevant.
  • The rewrite factor. How many of us have been on a software engineering project where the end result looking nothing like what we planned to build in the first place? And, knowing fully the requirements, we know that, if given time and the opportunity to rebuild it from scratch, it would have been way better? Not to mention: I grew a lot at LinkedIn as a software engineer, so the same me several years later would have done a far better job than the same me in 2008. Experience does matter.
  • I see that one of the advantages of the mobile server being in node.js is people could “leverage” (LinkedIn loves that word) their Javascript skills. Well, LinkedIn had/has hundreds of Java engineers! If that was a concern, we would have spent more time exploring Netty. Lies, damn lies, and benchmarks, I always say, but I think it’s safe for us to say that Netty (this is vertx, which sits on top of Netty) is at least as fast as node.js for web serving.
  • Firefighting? That was probably a combination of several things: the fact that we were running MRI and leaked memory, or the fact that the ops team was 30% of a single guy.

What I’m saying here is use your brain. Don’t read the High Scalability post and assume that you must build your next technology using node.js. It was definitely a better fit than Ruby on Rails for what the mobile server ended up doing, but it is not a performance panacea. You’re comparing a lower level server to a full stack web framework.

That’s all for tonight, folks, and thank you internet for finally goading me out of hiding again.

– Ikai

Written by Ikai Lan

October 4, 2012 at 6:34 pm