Ikai Lan says

I say things!

Archive for the ‘Uncategorized’ Category

Using Asynchronous URLFetch on Java App Engine

with 9 comments

Developers building applications on top of Java App Engine can use the familiar java.net interface for making off-network calls. For simple requests, this should be more than sufficient. The low-level API, however, provides one feature not available in java.net: asynchronous URLFetch.

The low-level URLFetch API

So what does the low-level API look like? Something like this:

import java.io.IOException;
import java.net.URL;
import java.util.List;

import com.google.appengine.api.urlfetch.HTTPHeader;
import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		try {
			URL url = new URL("http://someurl.com");
			HTTPResponse response = fetcher.fetch(url);

			byte[] content = response.getContent();
                     // if redirects are followed, this returns the final URL we are redirected to
			URL finalUrl = response.getFinalUrl();

			// 200, 404, 500, etc
			int responseCode = response.getResponseCode();
			List headers = response.getHeaders();

			for(HTTPHeader header : headers) {
				String headerName = header.getName();
				String headerValue = header.getValue();
			}

		} catch (IOException e) {
			// new URL throws MalformedUrlException, which is impossible for us here
		}

The full Javadocs are here.

So it’s a bit different than the standard java.net interface, where we’d get back a reader and iterate line by line on the response. We’re protected from a heap overflow because URLFetch is limited to 1mb responses.

Asynchronous vs. Synchronous requests

Using java.net has the advantage of portability – you could build a standard fetcher that will work in any JVM environment, even those outside of App Engine. The tradeoff, of course, is that App Engine specific features won’t be present. The one killer feature of App Engine’s low-level API that isn’t present in java.net is asynchronous URLFetch. What is asynchronous fetch? Let’s make an analogy:

Let’s pretend you, like me at home, are on DSL and have a pretty pathetic downstream speed, and decide to check out a link sharing site like Digg. You browse to the front page and decide to open up every single link. You can do this synchronously or asynchronously.

Synchronously

Synchronously, you click link #1. Now you look at this page. When you are done looking at this page, you hit the back button and click link #2. Repeat until you have seen all the pages. Now, again, you are on DSL, so not only do you spend time reading each link, before you read each destination page, you have to wait for the page to load. This can take a significant portion of your time. The total amount of time you sit in front of your computer is thus:

N = number of links
L = loading time per page
R = reading time per page

N * (L + R)

(Yes, before I wrote this equation, I thought that by including some mathematical formulas in my blog post would make me look smarter, but as it turns out the equation is something comprehensible by 8 year olds internationally/American public high school students)

Asynchronously

Using a tabbed browser, you control-click every single link on the page to open them in new tabs. Now before you go look at any of the pages, you decide to go to the kitchen and make a grilled cheese sandwich. When you are done with the sandwich, you come back to your computer sit down, enjoying your nice, toasty sandwich while you read articles about Ron Paul and look at funny pictures of cats. How much time are you spending?

N = number of links
S = loading time for the slowest loading page
R = reading time per page
G = time to make a grilled cheese sandwich
MAX((N * R + G), (N * R + S))

Which takes longer, your DSL, or the time it takes you to make a grilled cheese sandwich? The point that I’m making here is that you can save time by parallelizing things. No, I know it’s not a perfect analogy, as downloading N pages in parallel consumes the same crappy DSL connection, but you get what I am trying to say. Hopefully. And maybe you are also in the mood for some cheese.

Asynchrous URLFetch in App Engine

So what would the URLFetch above look like asynchronously? Probably something like this:

import java.io.IOException;
import java.net.URL;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

import com.google.appengine.api.urlfetch.HTTPHeader;
import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

	protected void makeAsyncRequest() {
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		try {
			URL url = new URL("http://someurl.com");
			Future future = fetcher.fetchAsync(url);

			// Other stuff can happen here!

			HTTPResponse response = future.get();
			byte[] content = response.getContent();
			URL finalUrl = response.getFinalUrl();
			int responseCode = response.getResponseCode();
			List headers = response.getHeaders();

			for(HTTPHeader header : headers) {
				String headerName = header.getName();
				String headerValue = header.getValue();
			}

		} catch (IOException e) {
			// new URL throws MalformedUrlException, which is impossible for us here
		} catch (InterruptedException e) {
			// Exception from using java.concurrent.Future
		} catch (ExecutionException e) {
			// Exception from using java.concurrent.Future
			e.printStackTrace();
		}

	}

This looks pretty similar – EXCEPT: fetchAsync doesn’t return an HTTPResponse. It returns a Future. What is a future?

java.concurrent.Future

From the Javadocs:

“A Future represents the result of an asynchronous computation. Methods are provided to check if the computation is complete, to wait for its completion, and to retrieve the result of the computation. The result can only be retrieved using method get when the computation has completed, blocking if necessary until it is ready. Cancellation is performed by thecancel method. Additional methods are provided to determine if the task completed normally or was cancelled.”

Future

What does this mean in English? It means that the Future object is NOT the response of the HTTP call. We can’t get the response until we call the get() method. Between when we call fetchAsync() and get, we can do other stuff: datastore operations, insert things into the Task Queue, heck, we can even do more URLFetch operations. When we finally DO call get(), one of two things happens:

  1. We’ve already retrieved the URL. Return an HTTPResponse object
  2. We’re still retrieving the URL. Block until we are done, then return an HTTPResponse object.

In the best case scenario, the amount of time it takes for us to do other things is equal to the amount of time it takes to do the URLFetch, and we save time. In the worst case scenario, the maximum amount of time is the time it takes to do the URLFetch or the other operations, whichever takes longer. The end result is that we lower the amount of time to return a response to the end-user.

Twitter Example

So let’s build a servlet that retrieves my tweets. Just for giggles, let’s do it 20 times and see what the performance difference is. We’ll make it so that if we pass a URL parameter, async=true (or anything, for simplicity), we do the same operation using fetchAsync. The code is below:

package com.ikai.urlfetchdemo;

import java.io.IOException;
import java.io.PrintWriter;
import java.net.URL;
import java.util.ArrayList;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

@SuppressWarnings("serial")
public class GetTwitterFeedServlet extends HttpServlet {

	protected static String IKAI_TWITTER_RSS = "http://twitter.com/statuses/user_timeline/14437022.rss";

	public void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		boolean isSyncRequest = true;

		if(req.getParameter("async") != null) {
			isSyncRequest = false;
		}

		resp.setContentType("text/html");
		PrintWriter out = resp.getWriter();
		out.println("<h1>Twitter feed fetch demo</h1>");

		long startTime = System.currentTimeMillis();
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		URL url = new URL(IKAI_TWITTER_RSS);

		if(isSyncRequest) {
			out.println("<h2>Synchronous fetch</h2>");
			for(int i = 0; i < 20; i++) {
				HTTPResponse response = fetcher.fetch(url);
				printResponse(out, response);
			}
		} else {
			out.println("<h2>Asynchronous fetch</h2>");
			ArrayList<Future<HTTPResponse>> asyncResponses = new ArrayList<Future<HTTPResponse>>();
			for(int i = 0; i < 20; i++) {
				Future<HTTPResponse> responseFuture = fetcher.fetchAsync(url);
				asyncResponses.add(responseFuture);
			}

			for(Future<HTTPResponse> future : asyncResponses){
				HTTPResponse response;
				try {
					response = future.get();
					printResponse(out, response);
				} catch (InterruptedException e) {
					// Guess you would do something here
				} catch (ExecutionException e) {
					// Guess you would do something here
				}
			}

		}

		long totalProcessingTime = System.currentTimeMillis() - startTime;
		out.println("<p>Total processing time: " + totalProcessingTime + "ms</p>");
	}

	private void printResponse(PrintWriter out, HTTPResponse response) {
		out.println("<p>");
		out.println("Response: " + new String(response.getContent()));
		out.println("</p>");
	}

}

As you can see, it’s a bit more involved to store all the Futures in a list, then to iterate through them. We’re also not being too intelligent about iterating through the futures: we’re assuming first-in-first-out (FIFO) with URLFetch, which may or may not be the case in production. A more optimized case may try to fetch the response from a call we know is faster before fetching from one we know is slower first. However – empirical testing will show that more often than not, doing things asynchronously is significantly faster for the user than synchronously.

Using Asynchronous URLFetch and HTTP POST

So far, our examples have been focused on read operations. What if we don’t care about the response? For instance, what if we decide to make an API call that essentially is a “write” operation, and can, for the most part, safely assume it will succeed?

// JavaAsyncUrlFetchDemoServlet.java
package com.ikai.urlfetchdemo;

import java.io.IOException;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

@SuppressWarnings("serial")
public class JavaAsyncUrlFetchDemoServlet extends HttpServlet {

	public void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		long startTime = System.currentTimeMillis();
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		fetcher.fetchAsync(FetchHelper.makeGuestbookPostRequest("Async", "At" + startTime));
		long totalProcessingTime = System.currentTimeMillis() - startTime;

		resp.setContentType("text/html");
		resp.getWriter().println("<h1>Asynchronous fetch demo</h1>");
		resp.getWriter().println("<p>Total processing time: " + totalProcessingTime + "ms</p>");
	}

}
// FetchHelper.java
package com.ikai.urlfetchdemo;

import java.net.MalformedURLException;
import java.net.URL;

import com.google.appengine.api.urlfetch.HTTPMethod;
import com.google.appengine.api.urlfetch.HTTPRequest;

public class FetchHelper {

	protected static final String signGuestBookUrl = //"http://bootcamp-demo.appspot.com/sign";

	public static HTTPRequest makeGuestbookPostRequest(String name, String content){
		HTTPRequest request = null;
		URL url;
		try {
			url = new URL(signGuestBookUrl);
			request = new HTTPRequest(url, HTTPMethod.POST);
			String body = "name=" + name + "&amp;content=" + content;
			request.setPayload(body.getBytes());

		} catch (MalformedURLException e) {
			// Do nothing
		}
		return request;
	}
}

I’ve decided to spam my own guestbook here, for better or for worse.

Download the code

You can download the code from this post here using git: http://github.com/ikai/Java-App-Engine-Async-URLFetch-Demo

Written by Ikai Lan

June 29, 2010 at 2:49 pm

Posted in Uncategorized

Using the bulkloader with Java App Engine

with 32 comments

The latest release of the datastore bulkloader greatly simplifies import and export of data from App Engine applications for developers. We’ll go through a step by step example for using this tool with a Java application. Note that only setting up Remote API is Java specific – everything can be used with Python applications. Unlike certain phone companies, this is one store that doesn’t care what language your application is written in.

Checking for our Prerequisites:

If you already have Python 2.5.x and the Python SDK installed, skip this section.

First off, we’ll need to download the Python SDK. This example assumes we have Python version 2.5.x installed. If you’re not sure what version you have installed, open up a terminal and type “python”. This opens up a Python REPL, with the first line displaying the version of Python you’re using. Here’s example output:

Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

(Yes, Pythonistas, the version on my laptop is ooooooooold).

Download the Python SDK from the following link. As of the writing of this post, the newest version is 1.3.4: Direct link.

Unzip this file. It’ll be easier for you if you place this in your path. Linux and OS X users will append this in their ~/.bash_profile:

PATH="/path/to/where/you/unzipped/appengine:${PATH}"
export PATH

To test that everything is working, type

appcfg.py

You’ll see a page of usage commands that starts out like this:

Usage: appcfg.py [options] <action>

Action must be one of:
create_bulkloader_config: Create a bulkloader.yaml from a running application.
cron_info: Display information about cron jobs.
download_data: Download entities from datastore.
help: Print help for a specific action.
request_logs: Write request logs in Apache common log format.
rollback: Rollback an in-progress update.
update: Create or update an app version.
update_cron: Update application cron definitions.
update_dos: Update application dos definitions.
update_indexes: Update application indexes.
update_queues: Update application task queue definitions.
upload_data: Upload data records to datastore.
vacuum_indexes: Delete unused indexes from application.
Use 'help <action>' for a detailed description.

…. (and so forth)

Now we can go ahead and start using the bulkloader.

Using the bulkloader for Java applications

Before we can begin using the bulkloader, we’ll have to set it up first. Setting up the bulkloader is a three step process. We’ll need to:

1. Add RemoteApi to mapping
2. Generate a bulkloader configuration

Step 1: Add RemoteApi to our URI mapping

We’ll want to edit our web.xml. Add the following lines:

    <servlet>
        <servlet-name>RemoteApi</servlet-name>
        <servlet-class>com.google.apphosting.utils.remoteapi.RemoteApiServlet</servlet-class>
    </servlet>

    <servlet-mapping>
        <servlet-name>RemoteApi</servlet-name>
        <url-pattern>/remote_api</url-pattern>
    </servlet-mapping>

A common pitful with setting up RemoteApi is that developers using frameworks will use a catch-all expression for mapping URIs, and this will stomp over our servlet-mapping. Deploy this application into production. We’ll likely want to put an admin constraint on this.

Step 2: Generate a bulkloader configuration

This step isn’t actually *required*, but it certainly makes our lives easier, especially if we are looking to export existing data. In a brand new application, if we are looking to bootstrap our application with data, we don’t need this step at all. For completeness, however, it’d be best to go over it.

We’ll need to generate a configuration template. This step depends on datastore statistics having been updated with the Entities we’re looking to export. Log in to appspot.com and click “Datastore Statistics” under Datastore in the right hand menu.

If we see something that looks like the following screenshot, we can use this tool.

If we see something that looks like the screenshow below, then we can’t autogenerate a configuration since this is a brand new application – that’s okay, that means we probably don’t have much data to export. We’ll have to wait for App Engine’s background tasks to bulk update our statistics before we’ll be able to complete this step.

Assuming that we have datastore statistics available, we can use appcfg.py in the following manner to generate a configuration file:

appcfg.py create_bulkloader_config --url=http://APPID.appspot.com/remote_api --application=APPID --filename=config.yml

If the datastore isn’t ready, running this command will cause the following error:

[ERROR   ] Unable to download kind stats for all-kinds download.
[ERROR   ] Kind stats are generated periodically by the appserver
[ERROR   ] Kind stats are not available on dev_appserver.

I’m using this on a Guestbook sample application I wrote for a codelab a while ago. The only Entities are Greetings, which consists of a String username, a String comment and a timestamp. This is what my config file looks like:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Greeting
  connector:

  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      import_transform: transform.key_id_or_name_as_string

    - property: content
      external_name: content
      # Type: String Stats: 7 properties of this type in this kind.

    - property: date
      external_name: date
      # Type: Date/Time Stats: 7 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

    - property: name
      external_name: name
      # Type: String Stats: 7 properties of this type in this kind.

We care about the connector. Replace that with the following:

- kind: Greeting
    connector: csv

We’ve only filled in the “connector” option. Now we have something we can use to dump data.

Examples of common usages of the bulkloader

Downloading data

We’ve got what we need to dump data. Let’s go ahead and do that now. Issue the following command:

appcfg.py download_data --config_file=config.yml --filename=data.csv --kind=Greeting --url=http://APPID.appspot.com/remote_api --application=APPID

We’ll be asked to provide our email and password credentials. Here’s what my console output looks like:

Downloading data records.
[INFO    ] Logging to bulkloader-log-20100609.162353
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100609.162353.sql3
[INFO    ] Opening database: bulkloader-results-20100609.162353.sql3
[INFO    ] Connecting to java.latest.bootcamp-demo.appspot.com/remote_api
2010-06-09 16:23:57,022 WARNING appengine_rpc.py:399 ssl module not found.
Without the ssl module, the identity of the remote host cannot be verified, and
connections may NOT be secure. To fix this, please install the ssl module from
http://pypi.python.org/pypi/ssl .
To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl .
Please enter login credentials for java.latest.bootcamp-demo.appspot.com
Email: YOUR EMAIL
Password for YOUR_EMAIL:
[INFO    ] Downloading kinds: ['Greeting']
.[INFO    ] Greeting: No descending index on __key__, performing serial download
.
[INFO    ] Have 17 entities, 0 previously transferred
[INFO    ] 17 entities (11304 bytes) transferred in 10.5 seconds

There’s now a CSV file named data.csv in my directory, as well as a bunch of autogenerated bulkloader-* files for resuming if the loader dies midway during the export. My CSV file starts like this:

content,date,name,key
Hey it works!,2010-05-18T22:35:17,Ikai Lan,1
… (More lines of CSV)

The first line is a labeling line – this line designates the order in which properties have been exported. In our case, we’ve exported content, date and name in addition to Entity keys.

Uploading Data

To upload the CSV file back into the datastore, we run the following command:

appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting

This’ll use config.yml and create our entities in the remote datastore.

Adding a new field to datastore entities

One question that is frequently asked in the groups is, “How do I migrate my schema?” This question is generally poorly phrased; App Engine’s datastore is schemaless. That is – it is possible to have Entities of the same Kind with completely different sets of properties. Most of the time, this is a good thing. MySQL, for instance, requires a table lock to do a schema update. By being schema free, migrations can happen lazily, and application developers can check at runtime for whether a Property exists on a given Entity, then create or set the value as needed.

But there are times when this isn’t sufficient. One use case is if we want to change a default value on Entities and grandfather older Entities to the new default value, but we also want the default value to possibly be null. We can do tricks such as creating a new Property, setting an update timestamp, checking for whether the update timestamp is before or after when we made the code change and update conditionally, and so forth. The problem with this approach is that it introduces a TON of complexity into our application, and if we have more than one of these “migrations”, suddenly we’re writing more code to lazily grandfather data and confusing the non-Cylons that work on our team. It’s easier to migrate all the data. So how we do this? Before the new application code goes live, we migrate the schema by adding the new field. The best part about this is that we can do this without locking tables, so writes can continue.

Let’s add a new String field to our Greeting class: homepageUrl. Let’s assume that we want to set a default to http://www.google.com. How would we do this? Let’s update our config.yml file to the following:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Greeting
  connector: csv

  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      import_transform: transform.key_id_or_name_as_string

    - property: content
      external_name: content
      # Type: String Stats: 7 properties of this type in this kind.

    - property: homepageUrl
      external_name: homepageUrl

    - property: date
      external_name: date
      # Type: Date/Time Stats: 7 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

    - property: name
      external_name: name
      # Type: String Stats: 7 properties of this type in this kind.

Note that we’ve added a new property with a new external_name. By default, the loader will use a String.

Now let’s add the field to our CSV file:

content,date,name,key,homepageUrl
Hey it works!,2010-05-18T22:35:17,Ikai Lan,1,http://www.google.com
... (more lines)

We’d likely write a script to augment our CSV file. Note that this only works if we have named keys! If we had integer keys before, we’ll end up creating duplicate entities using key names and not integer IDs.

Now we run the bulkloader to upload our entities:

appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting

Once our loader has finished running, we’ll see the new fields on our existing entities.

WARNING: There is a potential race condition here: if an Entity gets updated by our bulkloader in this fashion right as user facing code reads and updates the Entity without the new field, that will leave us with Entities that were grandfathered incorrectly. Fortunately, after we migrate, we can do a query for these Entities and manually update them. It’s slightly annoying, but far less painful than making bulkloader updates transactional.

Bootstrapping the datastore with default Entities

So we’ve covered the use case of using a generated config.yml file to update or load entities into the datastore, but what we haven’t yet covered is bootstrapping a completely new Entity Kind with never before seen data into the datastore.

Let’s add a new Entity Kind, Employee, to our datastore. We’ll preload this data:

name,title
Ikai Lan,Developer Programs Engineer
Patrick Chanezon,Developer Advocate
Wesley Chun,Developer Programs Engineer
Nick Johnson,Developer Programs Engineer
Jason Cooper,Developer Programs Engineer
Christian Schalk,Developer Advocate
Fred Sauer,Developer Advocate

Note that we didn’t add a key. In this case, we don’t care, so it simplifies our config files. Now let’s take a look at the config.yml we need to use:

python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Employee
  connector: csv

  property_map:

    - property: name
      external_name: name

    - property: title
      external_name: title

Now let’s go ahead and upload these entities:

$ appcfg.py upload_data --config_file=new_entity.yml --filename=new_entity.csv  --url=http://APPID.appspot.com/remote_api --kind=Employee
Uploading data records.
[INFO    ] Logging to bulkloader-log-20100610.151326
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100610.151326.sql3
[INFO    ] Connecting to APPID.appspot.com/remote_api
2010-06-10 15:13:27,334 WARNING appengine_rpc.py:399 ssl module not found.
Without the ssl module, the identity of the remote host cannot be verified, and
connections may NOT be secure. To fix this, please install the ssl module from
http://pypi.python.org/pypi/ssl .
To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl .
Please enter login credentials for APPID.appspot.com
Email: your.email@gmail.com
Password for your.email@gmail.com:
[INFO    ] Starting import; maximum 10 entities per post
.
[INFO    ] 7 entites total, 0 previously transferred
[INFO    ] 7 entities (5394 bytes) transferred in 8.6 seconds
[INFO    ] All entities successfully transferred

Boom! We’re done.

There are still a lot of bulkloader topics to discuss – related entities, entity groups, keys, and so forth. Stay tuned.

Written by Ikai Lan

June 10, 2010 at 2:52 pm

Posted in Uncategorized

Lucene In-Memory Search Example: Now updated for Lucene 3.0.1

with 3 comments

Update: Here’s a link to some sample code for Python using PyLucene. Thanks, Joseph!

While playing around with Lucene in my experiments to make it work with Google App Engine, I found an excellent example for indexing some text using Lucene in-memory; unfortunately, it dates back to May 2004 (!!!). I’ve updated the example to work with the newest version of Lucene, 3.0.1. It’s below for reference.

The Pastie link for the code snippet can be found here.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import java.io.IOException;

public class LuceneTest{

   public static void main(String[] args) {
      // Construct a RAMDirectory to hold the in-memory representation
      // of the index.
      RAMDirectory idx = new RAMDirectory();

      try {
         // Make an writer to create the index
         IndexWriter writer =
                 new IndexWriter(idx, 
                         new StandardAnalyzer(Version.LUCENE_30), 
                         IndexWriter.MaxFieldLength.LIMITED);

         // Add some Document objects containing quotes
         writer.addDocument(createDocument("Theodore Roosevelt",
                 "It behooves every man to remember that the work of the " +
                         "critic, is of altogether secondary importance, and that, " +
                         "in the end, progress is accomplished by the man who does " +
                         "things."));
         writer.addDocument(createDocument("Friedrich Hayek",
                 "The case for individual freedom rests largely on the " +
                         "recognition of the inevitable and universal ignorance " +
                         "of all of us concerning a great many of the factors on " +
                         "which the achievements of our ends and welfare depend."));
         writer.addDocument(createDocument("Ayn Rand",
                 "There is nothing to take a man's freedom away from " +
                         "him, save other men. To be free, a man must be free " +
                         "of his brothers."));
         writer.addDocument(createDocument("Mohandas Gandhi",
                 "Freedom is not worth having if it does not connote " +
                         "freedom to err."));

         // Optimize and close the writer to finish building the index
         writer.optimize();
         writer.close();

         // Build an IndexSearcher using the in-memory index
         Searcher searcher = new IndexSearcher(idx);

         // Run some queries
         search(searcher, "freedom");
         search(searcher, "free");
         search(searcher, "progress or achievements");

         searcher.close();
      }
      catch (IOException ioe) {
         // In this example we aren't really doing an I/O, so this
         // exception should never actually be thrown.
         ioe.printStackTrace();
      }
      catch (ParseException pe) {
         pe.printStackTrace();
      }
   }

   /**
    * Make a Document object with an un-indexed title field and an
    * indexed content field.
    */
   private static Document createDocument(String title, String content) {
      Document doc = new Document();

      // Add the title as an unindexed field...

      doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));


      // ...and the content as an indexed field. Note that indexed
      // Text fields are constructed using a Reader. Lucene can read
      // and index very large chunks of text, without storing the
      // entire content verbatim in the index. In this example we
      // can just wrap the content string in a StringReader.
      doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));

      return doc;
   }

   /**
    * Searches for the given string in the "content" field
    */
   private static void search(Searcher searcher, String queryString)
           throws ParseException, IOException {

      // Build a Query object
      QueryParser parser = new QueryParser(Version.LUCENE_30, 
              "content", 
              new StandardAnalyzer(Version.LUCENE_30));
      Query query = parser.parse(queryString);


      int hitsPerPage = 10;
      // Search for the query
      TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);
      searcher.search(query, collector);

      ScoreDoc[] hits = collector.topDocs().scoreDocs;

      int hitCount = collector.getTotalHits();
      System.out.println(hitCount + " total matching documents");

      // Examine the Hits object to see if there were any matches

      if (hitCount == 0) {
         System.out.println(
                 "No matches were found for \"" + queryString + "\"");
      } else {
         System.out.println("Hits for \"" +
                 queryString + "\" were found in quotes by:");

         // Iterate over the Documents in the Hits object
         for (int i = 0; i &lt; hitCount; i++) {
            ScoreDoc scoreDoc = hits[i];
            int docId = scoreDoc.doc;
            float docScore = scoreDoc.score;
            System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);

            Document doc = searcher.doc(docId);

            // Print the value that we stored in the "title" field. Note
            // that this Field was not indexed, but (unlike the
            // "contents" field) was stored verbatim and can be
            // retrieved.
            System.out.println("  " + (i + 1) + ". " + doc.get("title"));
            System.out.println("Content: " + doc.get("content"));            
         }
      }
      System.out.println();
   }
}

In progress: still trying to figure out how to get some version of Lucene working on App Engine for Java. My thoughts:

  • Use an In Memory index
  • Serialize to Memcache or the Datastore (not even sure how to do this right now)

Granted, there are limitations to this: if an App Engine application exceeds some memory limit, a SoftMemoryExceeded exception will be thrown. Also – I’m doubtful of the ability to update indexes incrementally in the datastore: not to mention, there’s a 1mb limit on datastore entries. The Blobstore, accessed programmatically, may not have the latency required. Still – it’s an interesting thought experiment, and there’s probably some compromise we can find with a future feature of App Engine that’ll allow us to make Lucene actually usable. We just have to think of it. Stay tuned. I’ll write another post if I can get even a proof-of-concept to work.

Written by Ikai Lan

April 24, 2010 at 8:32 am

How I use social media

leave a comment »

I’m often asked how I have time to use all the services I do such as Twitter, Facebook, LinkedIn, and so forth. I just do – it really doesn’t take up that much of my time, and I find that each of these services provides unique value to me, both professionally and socially. No, this isn’t a “thou shalt” type of article. Rather, it’s a breakdown of how I use the various social media outlets.

Facebook

I recently did a massive purge of my friends list. I had somewhere around 700 friends, and I sheared that down to a little over 200. I was adding folks without abandon for a while – if I met you once face to face, I clicked the “Accept” when that ever so exciting “New Friend Request” notification fell into my inbox. No more. I’ve sheared it down to people I actually talk to, or talk to me, or somehow interact with my Facebook page. Facebook is where I go to post pictures of myself wearing a fake mustache, or in a giant banana suit trying to negotiate a pose with a guy dressed up as Wolverine, or a video of my rendition of the Whitney Houston classic, “I have nothing.” Facebook is where I go to be stupid. Every once in a while, I’ll receive a business inquiry or friend request from someone I’ve met at a conference or in a professional context online. I’ll promptly decline.

LinkedIn

A quick disclaimer: I used to work at LinkedIn. While I was there, I loved working there and I loved the vision of the site. My entire engineering and design team that I worked with has since left, so I can’t attest to how it still is there (though there are still some cool people I know and talk to now and then).

LinkedIn is where I maintain a profile and look folks up several times a week. I used to check LinkedIn on my mobile device once a week at least to look for updates to folks I’ve worked with or had meaningful professional exchanges with, but in recent months the Twitter integration has made it too damned chatty and this feature isn’t useful for me anymore (I’ve expressed how I feel about this in a recent Tweet). For me, LinkedIn is for people search.

Twitter

Twitter has replaced LinkedIn as my weapon of choice for keeping up with my professional contacts. Nowadays, when I give a presentation or meet someone in the technical field, chances are, they have a Twitter account. I’ll tell them to follow me, and in most cases, I’ll reciprocate. Twitter has the entire ecosystem for me to use my tools on my desktop (Tweetdeck, Seesmic web) and on my mobile device (Tweetie 2 for iPhone, Seesmic for Android) to exchange quick messages and share interesting content. I don’t follow celebrities, and in most cases, I don’t follow my friends since many of them do not use Twitter professionally and cross post to both Facebook and Twitter anyway.

I’ve also met lots of interesting developers via Twitter. I have search columns in Tweetdeck and in my Seesmic Web session for terms I care about: scala, jruby, ruby, python, app engine, java, clojure. When I see someone tweeting interesting content, I will follow them and try to engage them. Most Twitter users are open to meetings at conferences or coffee/beer when they or you come to town. This is an incredibly unique feature of Twitter that allows me to find professionals by what they are currently working on or interested in that I can’t currently get anywhere else. If you work in technology, get on Twitter and start connecting.

Gowalla

I prefer Gowalla to Foursquare in spite of all the hype for the latter. Why? Gowalla has people I actually know adding me. Too many folks on Foursquare are people following me on Twitter that I have never met (very few of the folks I have met on Twitter have added me on Foursquare). There’s an argument here that it helps them find me. I don’t buy it. We’ll meet on terms we agree on, not when you find out I am near you and decide to pop in (the internet is still full of crazies). Gowalla has a nicer app for both iPhone and Android from a visual standpoint and from the fact that they actually figure out in the top three or so results where I am. On Foursquare, the place where I am at is too buried, and if it doesn’t exist, it’s too hard to add it – you have to add an address, city, zipcode, etc. On Gowalla this is a two step process.

So how useful is Gowalla to me? Marginal. If it disappeared tomorrow, nothing would change in my life. This is in stark contrast to Facebook, Twitter and LinkedIn. Gowalla’s primary value proposition to me is checking in some place cool and posting quickly to Facebook so my friends can make silly comments.

Google Buzz

Disclaimer #2: I work for Google now. We’ve been using Buzz months before it launched internally, and it is super useful to be able to look at all the things members of your team are doing. This is what Yammer has been trying to accomplish. I’ve heard from distributed teams that use Yammer that it facilitates greater teamwork and collaboration. I definitely feel this way about internal Buzz.

The public Buzz is pretty cool, but hasn’t caught on among as many as my friends as I would have liked. I’ve unwired my Twitter connections because I already see them on Twitter. Buzz has a better discussion mechanism than Facebook, LinkedIn, Twitter – basically all of those, because it bubbles recently commented items to the top. I use Buzz pretty frequently to discuss local sports and video games with a circle of about ten people. I pity the folks who follow me and don’t care about these topics.

Yelp

When I first created my Yelp account in 2006 or so, I wrote about 60 reviews in the first week. It was loads of fun to meet other local folks! Nowadays, there are so many reviews that I feel like reviews don’t matter anymore, and I haven’t written in ages. To fill that need, I’ve started guest blogging for The Culture Bite, a food blog run by a friend of mine. I still use Yelp to find reviews for restaurants, though I take them with a grain of salt because folks will downrate a place for different reasons. Some may downrate for mandatory 18% gratuity (lots of places do this, people), loud patrons, a poor parking situation, restricted hours, and so forth. People have different ideas of what should be considered part of the dining experience, and this makes my weight the rating system less.

Written by Ikai Lan

March 25, 2010 at 8:38 pm

Posted in Uncategorized

GoRuCo 2009 impressions

leave a comment »

I passed on RailsConf this year, mostly for scheduling reasons. Instead, I attended the Gotham Ruby Conference (GoRuCo) this past Saturday at Pace University at NYC. Overall, I was very impressed with the caliber of attendees and would go again.

While all of the talks were outstanding, there were a few that I felt really stood out:

Where is Ruby headed?

Gregory Brown, creator of the Prawn PDF generation library for Ruby, funded completely by the community.

This was the first talk of the day. I didn’t sleep particularly well the night before and was a bit worried that I wouldn’t be ready to get anything out of the talk, but I’m happy to say that I was wrong. Gregory addressed several of the issues he had with making sure that Prawn would be compatible with both 1.8.6 and 1.9.1. He addressed various topics from very specific code samples of 1.8.6 and 1.9.1 differences to various workarounds for code that needs to be compatible with both versions (I/O iterators … #each is now #each_line). Eventually, this led to a few interesting discussions:

  • The Ruby versioning system is EXTREMELY confusing. 1.8.7 is not a *minor* update, but a rather significant one, and 1.9.1 is way more mature than 1.9
  • The general consensus is to not try to support 1.8.7, and to focus on 1.8.6 and 1.9.1. Eric Hodel, maintainer of RubyGems, will be moving towards only support 1.9.1+ in the coming months, but it’ll be a slow process.
  • The best way to include external C libraries in Ruby on Windows is FFI with JRuby. It’s a mad, mad, mad world.

SOLID Object Oriented Design

Sandi Metz

The slides are here: http://skmetz.home.mindspring.com/img1.html

I thought this was an amazing talk. Sandi discussed the methodology behind refactoring Object Oriented code. In this case, she discussed the design of a Ruby FTP client that would download a CSV file and do some trivial computation.

However, I felt that Sandi’s example is purposely hypothetical and meant to illustrate refactoring techniques. Realistically, for something as simple as an FTP client, the example was an exercise in over-engineering. There’s a balance between extensibility/modularity and creating a simple to use interface. It’s clear that Sandi is leaning towards the former, as it got to a point where she was evaluating a class name from a YAML file (RED FLAG, RED FLAG). I disagree with this approach, since Ruby can be so terse and powerful that in many cases, code should be configuration, especially if the configuration is for specifying a class name for the purposes of dependency injection. This is not the Ruby way, and there were comments that we had essentially turned the FTP client into something that was all about configuration and not convention, and it eventually leads to death by XML. Use judgment. Sandi’s techniques are great for refactoring complex interactions, but as Rubyists we understand that not everything is an object. Make the code readable for humans, and make the APIs clean. I don’t know if I agree with Sandi’s philosophy that “since you don’t know how your code will be used, it should be as modular as possible”. That sounds to me like front-loading engineering. I’m going to quote Reid Hoffman: “If you are not embarrassed by the first version of your product, you’ve launched too late.” Be pragmatic.

From Rails to Rack: Making Rails 3 a Better Ruby Citizen

Yehuda Katz, Project Lead for Merb, Engine Yard

Yehuda discussed a few of the major refactorings of Rails taking place for version 3. It’s all the type of stuff you hear that is great about Merb: Rack-ability (ability to mount Rack apps inside Rails), ORM agnosticism and view agnosticism. What has me really excited, however, is the approach is unobtrusive JavaScript for RJS helpers. When I became proficient at JavaScript, I stopped using RJS not because it didn’t give me enough options, but because I found that inlining JavaScript had a tendency to get in the way.

So did I miss anything? Please comment if I did!

Written by Ikai Lan

June 2, 2009 at 11:15 am

Posted in Uncategorized

Toeing the line: My take on the GoGaRuCo presentation fiasco

with one comment

No, I’m not going to further whip Matt Aimonetti about his poorly chosen theme for his talk about CouchDB at GoGaRuCo this year. I think that enough has been said about the topic by far more influential members of the community than myself (there are many more links). Matt Aimonetti has already apologized, and I believe that he is genuinely apologetic. I will disagree, however, with Sarah Mei’s assertion that Rails is still a ghetto. If anything, the fact that we have been able to discuss this topic in such depth shows that the Ruby diaspora is not a group of antisocial hackers. We’re not a ghetto; we’re a real community that cares not only for the quality of our craft, but for being responsible craftsmen that care about having a positive impact on society through our passion and professionalism.

What I haven’t seen a lot of, however, is how we can benefit from what has already happened. We can’t go back and stuff the proverbial cat back in the bag. What’s done is done, and in order to move on we need to learn from our mistakes. I can’t rationalize why Matt Aimonetti did what he did better than he can, but I can an embarrassing story about myself making a very similar mistake. “All this has happened before, and all this will happen again,” I believe the adage goes:

My sophmore year in college, I took a job working in Housing IT to pay the bills. I was majoring in Computer Science and had to start somewhere; prior to the job I had worked doing data entry and filing clerk type duties. It was a huge step up for me. I was 19, pretty good with computers, and overall excited to finally have a real tech job.

For whatever reason, at the time, SJSU Housing IT’s leadership, in all their wisdom, decided that every student that would be using the residential network (ResNet) needed to have their NIC’s MAC address registered and a static IP configured. I suppose there were liability reasons, or the rules were created by someone with no knowledge of how modern networks function, or whatever. The point is that it was a particularly arduous job to go around to every student’s computer and, no matter how old or how messed up their copies of Windows 98 and Windows ME were, to get them up and running with internet access. Housing IT hired 5 of us students to serve as PARCs, short for Peer Advisors for Residential Computing, to deal with the management of this process, and when the initial configuration was done, to slip into more Helpdesk like positions. We had these gigantic binders of printouts of MAC addresses, their assigned Ethernet port number, and their IP address, and the first few weeks of school, I spent my time buried in hell of Windows TCP/IP configuration hell (random note: a semester after I left, one ambitious student set up DHCP … and I said to him, “You idiot, you just put all the remaining PARCs out of work” in jest).

The challenge we had was information propagation and telling students the process for signing up for an internet configuration appointment. You have to remember what life was like before the age of ubiquitous Web 2.0 for college students. You got information out by posting fliers that no one read and making announcements that no one listens to. And of course, we couldn’t post information on the website or email students, because, well, they didn’t have an internet connection.

So being 19, ambitious and a bit of a maverick, I took it upon myself to start a bit of what resembled a viral marketing campaign. I took pictures of all the different PARCs and created posters that would have what I thought were funny taglines, a picture of a PARC, and a short description of the process of signing up for internet. The taglines were pretty random. I don’t remember most of them, but they were along these lines:

GOT MILK?
BE MY FRIEND!
DO YOU LIKE MY HAIR?

In my mind, it was pretty hilarious, since housing seemed to make it a point to take terribly unflattering pictures of us wearing branded shirts in the only size they could afford: XL. Ah, public school.

I saved the edgy one for myself, since, hey, I’m an edgy guy. I had a grin from cheek to cheek and blinked during the photo. Here’s what I put above my face:

DOES THIS LOOK LIKE A CHILD ABDUCTOR?

It was funny, right? Right? I mean, I laughed when I was making it.

Imagine my surprise when the excrement hit the fan. I had raised a furor, and it came All The Way Down The Mountain from the desk of the president of housing through several layers of management, and finally my boss, who called everyone on my team to ask who the culprit was. I quickly fessed up, and even though I was a little bit confused – it was only a joke – I was summoned to his office, where he asked me about the fliers, holding one in hand.

“What part of you thought this was a good idea?” He asked. He was usually a pretty jovial guy. He was not smiling.

I didn’t know, and I told him. But I didn’t try to defend why I did it. In fact, at this point, part of me still thought I was in trouble for unofficially posting fliers that hadn’t been signed off (and rightly so) and not because of the flier with my face on it subtlety implying that I was some kind of a sex offender.

“No,” my boss said to me. “These other ones are fine. You’re a smart guy, what the hell could compel you to make a flier like this? Don’t you realize that many of these parents are first time college parents, and this is the first time their children are away from home?” He paused and looked me in the eye because he wanted the next point to really stick, “Don’t you realize that we might fire you?”

Wow. I slunk down in my chair. My first real day working my first job I cared about, and I was on the verge of losing it. Looking back, I sometimes wonder if my shame came more from the fact that I almost lost my job, or more from the fact that I could do something so boneheaded and not realize how people would react. Luckily, I didn’t lose my job, though I spent a lot of time hunting down fliers and apologizing to people. And configuring &%&*@* Windows ME TCP/IP settings, of course.

In retrospect, I’m glad the whole mess took place, because I learned a lot about being edgy, and the seminar we had prior to student check-in about respecting a diversity of opinions really sunk in. I haven’t forgotten them since, and I don’t think I ever will. I learned that even though I am, myself, someone who likes to make jokes and be edgy and push the envelope, not everyone else is. The same traits that can make me interesting and fun to be around also have the potential to make me obnoxious, offensive, and to create a bad perception of myself in the eyes of people who don’t really know me yet. The easiest lesson to extract here is to simply err on the side of caution when communicating with people who don’t know you, but I’d like to think it goes much deeper than that. I don’t think we give ourselves enough credit for just being decent people on a day-to-day basis, because interacting with other humans is really challenging, and we’ll often get it wrong. Even though we can usually be safe by avoiding sex, politics and religion, we still run the risk of offending people. I didn’t realize my “child abduction” tagline would lead to parental types with torches and pitchforks calling for my head. I stayed in a hostel on a recent overseas trip and the manager confided with me that he was there because he spent a few years in a maximum security penitentiary and was deported; a few days later I made some joke about a jail cell to him before I realized how stupid that was and walked away, savoring the taste of my foot. Human beings are notoriously difficult creatures, and that’s why there’s a premium on respecting others. It’s not easy to do. We can be R-rated individuals. I’m one. But that doesn’t give us the right to be assholes.

I can see how and why Matt Aimonetti ended up with the presentation he did, because I’ve made the same mistake. Believe me, everyone, he feels like crap. Thanks to the interwebs, it’s a thousand times worse for him than it was for me. A few months ago, I submitted a talk about social media, and the conference organizer emailed me back giving me some advice about one of the screenshots of Facebook I had taken. There was a swear word on one of the images. I honestly didn’t even notice it and I quickly removed it because it added nothing to my talk. He didn’t ask me to remove it, he just suggested that in previous years, some people complained about swearing. I wouldn’t have been one of those people, but also, I wouldn’t have been one of the people complaining about people complaining about swearing. As a community, we have matured to the point where we are starting to set boundaries. I can understand the Ruby community’s aversion to fences; Matz has always placed the importance of happiness over that of arbitrary rules. These rules, however, aren’t arbitrary. In fact, it’s through these boundaries that we open up our community to outsiders and become inclusive of people who want to create great applications, use fantastic tools, and work with awesome people. In a sense, I’m glad someone has set off this fire, because a lot of things that needed to be said have been said, and as a result, I’m absolutely positive the Ruby community will become better for it. I’m glad someone has pushed the boundaries a bit too far, and as a result, we’ve all stepped back and said, “No, we can’t.” And lastly, I know it’s selfish: I’m glad that someone wasn’t me.

– Ikai

Written by Ikai Lan

May 4, 2009 at 11:52 pm

Posted in Uncategorized

First post, WordPress support rocks

with one comment

I had an issue with the domain, and within 15 minutes, not one, but TWO folks at Automattic responded to my request and resolved the issue. Thanks guys! You guys are awesome. Hope you guys are always going to respond like this!

Ikai

Written by Ikai Lan

December 27, 2008 at 12:03 pm

Posted in Uncategorized