Ikai Lan says

I say things!

GWT, Blobstore, the new high performance image serving API, and cute dogs on office chairs

with 22 comments

I’ve been working on an image sharing application using GWT and App Engine to familiarize myself with the newer aspects of GWT. The project and code are here:

http://ikai-photoshare.appspot.com
http://github.com/ikai/gwt-gae-image-gallery

(Please excuse spaghetti code in client side GWT code, much of it was me feeling my way around GWT. I’ve come to appreciate GWT quite a bit in spite of the fact that I’m pretty familiar with client side development; I’ll write about this in a future post).

The 1.3.6 release of the App Engine SDK shipped with a high performance image serving API. What this means is that a developer can take a blob key pointing to image data stored in the blobstore and call getServingUrl() to create a special URL for serving the image. What are the benefits to using this API?

  • You don’t have to write your own handler for uploaded images
  • You don’t have to consume storage quota for saving resized or cropped images, as you can perform transforms on the image simply by appending URL parameters. You only need to store the final URL that is generated by getServingUrl().
  • You aren’t charged for datastore CPU for fetching the image (you will still be billed for bandwidth)
  • Images are, in general, served from edge server locations which can be geographically located closer to the user

There are a few drawbacks, however, to using the API:

  • There aren’t any great schemes for access control of the images, and if someone has the URL for a thumbnail, they can easily remove the parameters to see a larger image
  • Billing must be enabled – you will only be charged for usage, however, so you don’t have to spend a cent to use the API. You just have to have billing active.
  • Deleting an image blob doesn’t delete the image being served from the URL right away – that image will still be available for some time
  • Images must be uploaded to the blobstore, not the datastore as a blob, so it’s important to understand how the blobstore API works
  • The URLs of the created images are really, really ugly. If you need pretty URLs, it’s probably a better pattern to create a URL mapping to an HTML page that just displays the image in an IMG tag

Blobstore crash course

It’ll be best if we gave a quick refresher course on the blobstore before we begin. Here’s the standard flow for a blobstore upload:

  1. Create a new blobstore session and generate an upload URL for a form to POST to. This is done using the createUploadUrl() method of BlobstoreService. Pass a callback URL to this method. This URL is where the user will be forwarded after the upload has completed.
  2. Present an upload form to the user. The action is the URL generated in step 1. Each URL must be unique: you cannot use the same URL for multiple sessions, as this will cause an error.
  3. After the URL has uploaded the file, the user is forwarded to the callback URL in your App Engine application specified in step 1. The key of the uploaded blob, a String blob key, is passed as an URL parameter. Save this URL and pass the user to their final destination

Got it? Now we can talk about image serving.

Using the image serving URL

Once we have a blob key (step 3 of a Blobstore upload), we can do interesting things with it. First, we’ll need to create an instance of the ImagesService:

ImagesService imagesService = ImagesServiceFactory.getImagesService();

Once we have an instance, we pass the blob key to getServingUrl and get back a URL:

String imageUrl = imagesService.getServingUrl(blobKey);

This can sometimes take several hundred milliseconds to a few seconds to generate, so it’s almost always a good idea to run this on write as opposed to first read. Subsequent calls should be faster, but they may not be as fast as reading this value from a datastore entity property or memcache. Since this value doesn’t change, it’s a good idea to store it. On the local dev server, this URL looks something like this:

/_ah/img/eq871HJL_bYxhWQbTeYYoA

In production, however, this will return a URL that looks like this:

http://lh5.ggpht.com/2PQk0vDo8Bn8oiPba2gtGlDfd1ciD0H0MLrixcT12FCDQEm2oyMW9ErJX_-ZzOHBWbYBKzevK0BY6cxdZ3cxf_37

(Cute dogs below)

You’ve already saved yourself the trouble of writing a handler. What’s really nice about this URL is that you can perform operations on it just by appending parameters. Let’s say we wanted to crop our image to be no larger than 200×200, yet retain scale. We’d simply append “=s200” to the end of the image:

http://lh5.ggpht.com/2PQk0vDo8Bn8oiPba2gtGlDfd1ciD0H0MLrixcT12FCDQEm2oyMW9ErJX_-ZzOHBWbYBKzevK0BY6cxdZ3cxf_37=s144

(Looks like this)

We can also crop the image by appending a “-c” to the size parameter:

http://lh5.ggpht.com/2PQk0vDo8Bn8oiPba2gtGlDfd1ciD0H0MLrixcT12FCDQEm2oyMW9ErJX_-ZzOHBWbYBKzevK0BY6cxdZ3cxf_37=s144-c

(Looks like this – compare with above)

Note that we can also generate these URLs programmatically using the overloaded version of getServingUrl that also accepts a size and crop parameter.

Adding GWT

So now that we’ve got all that done, let’s get it working with GWT. It’s important that we understand how it all works, because GWT’s single-page, Javascript-generated content model must be taken into account. Let’s draw our upload widget. We’ll be using UiBinder:

We’ll create our Composite class as follows:

public class UploadPhoto extends Composite {

    private static UploadPhotoUiBinder uiBinder = GWT.create(UploadPhotoUiBinder.class);

    UserImageServiceAsync userImageService = GWT.create(UserImageService.class);

    interface UploadPhotoUiBinder extends UiBinder {}

    @UiField
    Button uploadButton;

    @UiField
    FormPanel uploadForm;

    @UiField
    FileUpload uploadField;

    public UploadPhoto(final LoginInfo loginInfo) {
        initWidget(uiBinder.createAndBindUi(this));
    }

}

Here’s the corresponding XML file:

<!DOCTYPE ui:UiBinder SYSTEM "http://dl.google.com/gwt/DTD/xhtml.ent">
<ui:UiBinder xmlns:ui="urn:ui:com.google.gwt.uibinder"
	xmlns:g="urn:import:com.google.gwt.user.client.ui">
	<g:FormPanel ui:field="uploadForm">
		<g:HorizontalPanel>
			<g:FileUpload ui:field="uploadField"></g:FileUpload>
			<g:Button ui:field="uploadButton"></g:Button>
		</g:HorizontalPanel>
	</g:FormPanel>
</ui:UiBinder> 

(We’ll add more to this later)

When we discussed the Blobstore, we mentioned that each upload form has a different POST location corresponding to the upload session. We’ll have to add a GWT-RPC component to generate and return a URL. Let’s do that now:

// UserImageService.java
@RemoteServiceRelativePath("images")
public interface UserImageService extends RemoteService  {
    public String getBlobstoreUploadUrl();
}

Our IDE will nag us to generate the corresponding Async interface if we have a GWT plugin:

// UserImageServiceAsync.java
public interface UserImageServiceAsync {
    public void getBlobstoreUploadUrl(AsyncCallback callback);
}

We’ll need to write the code on the server side:

// UserImageServiceImpl.java
@SuppressWarnings("serial")
public class UserImageServiceImpl extends RemoteServiceServlet implements UserImageService {

    @Override
    public String getBlobstoreUploadUrl() {
        BlobstoreService blobstoreService = BlobstoreServiceFactory.getBlobstoreService();
        return blobstoreService.createUploadUrl("/upload");
    }

}

This is pretty straightforward. We’ll want to invoke this service on the client side when we build the form. Let’s add this to UploadPhoto:

public class UploadPhoto extends Composite {

private static UploadPhotoUiBinder uiBinder = GWT.create(UploadPhotoUiBinder.class);
UserImageServiceAsync userImageService = GWT.create(UserImageService.class);

interface UploadPhotoUiBinder extends UiBinder {}

    @UiField
    Button uploadButton;

    @UiField
    FormPanel uploadForm;

    @UiField
    FileUpload uploadField;

    public UploadPhoto() {
        initWidget(uiBinder.createAndBindUi(this));

        // Disable the button until we get the URL to POST to
        uploadButton.setText("Loading...");
        uploadForm.setEncoding(FormPanel.ENCODING_MULTIPART);
        uploadForm.setMethod(FormPanel.METHOD_POST);
        uploadButton.setEnabled(false);
        uploadField.setName("image");

        // Now we use out GWT-RPC service and get an URL
        startNewBlobstoreSession();

        // Once we've hit submit and it's complete, let's set the form to a new session.
        // We could also have probably done this on the onClick handler
        uploadForm.addSubmitCompleteHandler(new FormPanel.SubmitCompleteHandler() {

            @Override
            public void onSubmitComplete(SubmitCompleteEvent event) {
                uploadForm.reset();
               startNewBlobstoreSession();
            }
        });
    }

    private void startNewBlobstoreSession() {
        userImageService.getBlobstoreUploadUrl(new AsyncCallback() {

            @Override
            public void onSuccess(String result) {
                uploadForm.setAction(result);
                uploadButton.setText("Upload");
                uploadButton.setEnabled(true);
            }

            @Override
            public void onFailure(Throwable caught) {
                // We probably want to do something here
            }
        });
    }

    @UiHandler("uploadButton")
    void onSubmit(ClickEvent e) {
        uploadForm.submit();
    }

}

This is fairly standard GWT RPC.

So that concludes the GWT part of it. We mentioned an upload callback. Let’s implement that now:

/**
 * @author Ikai Lan
 * 
 *         This is the servlet that handles the callback after the blobstore
 *         upload has completed. After the blobstore handler completes, it POSTs
 *         to the callback URL, which must return a redirect. We redirect to the
 *         GET portion of this servlet which sends back a key. GWT needs this
 *         Key to make another request to get the image serving URL. This adds
 *         an extra request, but the reason we do this is so that GWT has a Key
 *         to work with to manage the Image object. Note the content-type. We
 *         *need* to set this to get this to work. On the GWT side, we'll take
 *         this and show the image that was uploaded.
 * 
 */
@SuppressWarnings("serial")
public class UploadServlet extends HttpServlet {
	private static final Logger log = Logger.getLogger(UploadServlet.class
			.getName());

	private BlobstoreService blobstoreService = BlobstoreServiceFactory
			.getBlobstoreService();

	public void doPost(HttpServletRequest req, HttpServletResponse res)
			throws ServletException, IOException {

		Map blobs = blobstoreService.getUploadedBlobs(req);
		BlobKey blobKey = blobs.get("image");

		if (blobKey == null) {
			// Uh ... something went really wrong here
		} else {

			ImagesService imagesService = ImagesServiceFactory
					.getImagesService();

			// Get the image serving URL
			String imageUrl = imagesService.getServingUrl(blobKey);

			// For the sake of clarity, we'll use low-level entities
			Entity uploadedImage = new Entity("UploadedImage");
			uploadedImage.setProperty("blobKey", blobKey);
			uploadedImage.setProperty(UploadedImage.CREATED_AT, new Date());

			// Highly unlikely we'll ever filter on this property
			uploadedImage.setUnindexedProperty(UploadedImage.SERVING_URL,
					imageUrl);

			DatastoreService datastore = DatastoreServiceFactory
					.getDatastoreService();
			datastore.put(uploadedImage);

			res.sendRedirect("/upload?imageUrl=" + imageUrl);
		}
	}

	@Override
	protected void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws ServletException, IOException {

		String imageUrl = req.getParameter("imageUrl");
		resp.setHeader("Content-Type", "text/html");

		// This is a bit hacky, but it'll work. We'll use this key in an Async
		// service to
		// fetch the image and image information
		resp.getWriter().println(imageUrl);

	}
}

We’ll probably want to display the image we just uploaded in the client. Let’s add a line line of code to register a SubmitCompleteHandler to do this:

	public void onSubmitComplete(SubmitCompleteEvent event) {
		uploadForm.reset();
		startNewBlobstoreSession();

		// This is what gets the result back - the content-type *must* be
		// text-html
		String imageUrl = event.getResults();
		Image image = new Image();
		image.setUrl(imageUrl);

		final PopupPanel imagePopup = new PopupPanel(true);
		imagePopup.setWidget(image);

		// Add some effects
		imagePopup.setAnimationEnabled(true); // animate opening the image
		imagePopup.setGlassEnabled(true); // darken everything under the image
		imagePopup.setAutoHideEnabled(true); // close image when the user clicks
												// outside it
		imagePopup.center(); // center the image

	}

And we’re done!

Get the code

I’ve got the code for this project here:

http://github.com/ikai/gwt-gae-image-gallery

Just a warning, this is a bit different from the sample code above. I wrote this post after I wrote the code, extrapolating the bare minimum to make this work. The sample code above has experimental tagging, delete and catches logins. I’m adding features to it simply to see what else can be done, so expect changes. I’m aware of a few of the bugs with the code, and I’ll get around to fixing them, but again, it’s a demo project, so keep realistic expectations. As far as I can tell, however, the code above should be runnable locally and deployable (once you have enabled billing for blobstore).

Happy developing!

Advertisements

Written by Ikai Lan

September 8, 2010 at 5:00 pm

Posted in App Engine, Java, Java

Using the App Engine Mapper for bulk data import

with 16 comments

Since my last post describing App Engine mapreduce, a new InputReader has been added to the Java project for reading from the Blobstore. Nick Johnson wrote a great demo where indexing was done via reading code uploaded to the blobstore. This was demo’d at Google I/O. Now that the library is officially part of the project, it’s become much easier for developers to build Mappers that map across some large, contiguous piece of data as opposed to Entities in the datastore.The most obvious use case is data import. A developer looking to import large amounts of data would take the following steps:

  1. Create a CSV file containing the data you want to import. The assumption here is that each line of data corresponds to a datastore entity you want to create
  2. Upload the CSV file to the blobstore. You’ll need billing to be enabled for this to work.
  3. Create your Mapper, push it live and run your job importing your data.

This isn’t meant to be a replacement for the bulk uploader tool; merely an alternative. This method requires a good amount more programmatic changes for custom data transforms. The advantage of this method is that the work is done on the server side, whereas the bulk uploader makes use of the remote API to get work done. Let’s get started on each of the steps.

Step 1: Create a CSV file with the data you want to upload

We’re going to go through an example of uploading City and State information. MaxMind.com provides a free GeoIP CSV file. The free version isn’t as full featured as the paid version, but it’ll do fine for our demo. Be sure that if you use this file in any kind of production application that you read and understand the license first! For simplicity, we’re going to parse out only cities in the United States using grep. The file should now contain lines that look like this:

605,"US","NY","Valhalla","10595",41.0877,-73.7768,501,914
606,"US","PA","Pittsburgh","15222",40.4495,-79.9880,508,412
607,"US","MO","Bridgeton","63044",38.7667,-90.4201,609,314
608,"US","CA","San Francisco","94124",37.7312,-122.3826,807,415
609,"US","NY","New York","10017",40.7528,-73.9725,501,212
610,"US","PA","Bear Lake","16402",41.9491,-79.4448,516,814
611,"US","NJ","Piscataway","08854",40.5516,-74.4637,501,732
612,"US","NY","Keuka Park","14478",42.5669,-77.1325,555,315
613,"US","VT","Brattleboro","05302",42.8496,-72.6645,506,802

2. Create an upload handler for your CSV file and upload the CSV file

We’re going to create a basic handler for uploading a CSV file and displaying the key. We’ll need to pass this key to our mapper later. There isn’t too much magic here; it’s very similar to the sample code available for the basic blobstore example.

We’ll do a quick overview of the code we need here, but for the purposes of this post, it’s out of scope. We’ll need these files:

upload.jsp

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
    pageEncoding="ISO-8859-1"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<%@page import="com.google.appengine.api.blobstore.BlobstoreService"%>
<%@page import="com.google.appengine.api.blobstore.BlobstoreServiceFactory"%>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Upload your CSV file here</title>
</head>
<body>
    <% BlobstoreService blobstoreService = BlobstoreServiceFactory.getBlobstoreService(); %>
    <form action="<%= blobstoreService.createUploadUrl("/upload") %>" method="post" enctype="multipart/form-data">
        <input type="file" name="data">
        <input type="submit" value="Submit">
    </form>
</body>
</html>

UploadBlobServlet.java

package com.ikai.mapperdemo.servlets;

import java.io.IOException;
import java.util.Map;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.google.appengine.api.blobstore.BlobKey;
import com.google.appengine.api.blobstore.BlobstoreService;
import com.google.appengine.api.blobstore.BlobstoreServiceFactory;

@SuppressWarnings("serial")
public class UploadBlobServlet extends HttpServlet {
	public void doPost(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		BlobstoreService blobstoreService = BlobstoreServiceFactory.getBlobstoreService();
		Map<String, BlobKey> blobs = blobstoreService.getUploadedBlobs(req);
		BlobKey blobKey = blobs.get("data");

		if (blobKey == null) {
			resp.sendRedirect("/");
		} else {
			resp.sendRedirect("/upload-success?blob-key=" + blobKey.getKeyString());
		}
	}

}

SuccessfulUploadServlet.java

package com.ikai.mapperdemo.servlets;

import java.io.IOException;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

@SuppressWarnings("serial")
public class SuccessfulUploadServlet extends HttpServlet {
	public void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		String blobKey = req.getParameter("blob-key");

		resp.setContentType("text/html");
		resp.getWriter().println("Successfully uploaded. Download file: <br/>");
		resp.getWriter().println(
				"<a href='/serve?blob-key=" + blobKey
						+ "'>Click to download</a>");
	}

}

Source code for this and other helper functions should be available in the Github repository.

Step 3: Create your Mapper

Now we get to the fun part. We need to create our Mapper. A prerequisite for understanding what’s coming next is reading the last post about Mapper I wrote, so check that out before proceeding if you aren’t familiar with Mapper basics. Our Mapper class looks like this:

ImportFromBlobstoreMapper.java

package com.ikai.mapperdemo.mappers;

import java.util.logging.Logger;

import org.apache.hadoop.io.NullWritable;

import com.google.appengine.api.datastore.Entity;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
import com.google.appengine.tools.mapreduce.BlobstoreRecordKey;
import com.google.appengine.tools.mapreduce.DatastoreMutationPool;

/**
 * 
 * This Mapper imports from a CSV file in the Blobstore. The CSV
 * assumes it's in the MaxMind format for cities, states, zipcodes
 * and lat/long.
 * 
 * 
 * @author Ikai Lan
 *
 */
public class ImportFromBlobstoreMapper extends
		AppEngineMapper<BlobstoreRecordKey, byte[], NullWritable, NullWritable> {
	private static final Logger log = Logger.getLogger(ImportFromBlobstoreMapper.class
			.getName());

	@Override
	public void map(BlobstoreRecordKey key, byte[] segment, Context context) {
		
		String line = new String(segment);
		
		log.info("At offset: " + key.getOffset());
		log.info("Got value: " + line);
		
		// Line format looks like this:
		// 10644,"US","VA","Tazewell","24651",37.0595,-81.5220,559,276
		// We're also assuming no errant commas in this simple example
		
		String[] values = line.split(",");
		String state = values[2];
		String cityName = values[3];		
		String zipcode = values[4];
		Double latitude = Double.parseDouble(values[5]);
		Double longitude = Double.parseDouble(values[6]);		
		
		state = state.replaceAll("\"", "");
		cityName = cityName.replaceAll("\"", "");
		zipcode = zipcode.replaceAll("\"", "");
		
		if(!zipcode.isEmpty()) {
			Entity zip = new Entity("Zip", zipcode);
			zip.setProperty("state", state);
			zip.setProperty("city", cityName);
			zip.setProperty("latitude", latitude);
			zip.setProperty("longitute", longitude);
			
			Entity city = new Entity("City", cityName);
			city.setProperty("state", state);
			city.setUnindexedProperty("zip", zipcode);
			
			DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
					.getMutationPool();
			mutationPool.put(zip);
			mutationPool.put(city);
		}

	}
}

Let’s explain the things in this Mapper that are new:

public class ImportFromBlobstoreMapper extends
AppEngineMapper&lt;BlobstoreRecordKey, byte[], NullWritable, NullWritable&gt;

Note this line. It’s different from our previous Mappers in that the type arguments are no longer Key and Entity, but BlobstoreRecordKey and byte[]. The source for BlobstoreRecordKey is here. Remember that map-reduce is about some large body of data and breaking it into smaller pieces to operate on. BlobstoreRecordKey represents a pointer to range of data in our Blobstore. byte[] is a byte[] array actually containing that data.

public void map(BlobstoreRecordKey key, byte[] segment, Context context)

Again, notice the new types. By default, we are splitting on a newline, so segment represents a single line. We can change what we split on by specifying a terminator in mapreduce.xml.

		String line = new String(segment);
		
		// Line format looks like this:
		// 10644,"US","VA","Tazewell","24651",37.0595,-81.5220,559,276
		// We're also assuming no errant commas in this simple example
		
		String[] values = line.split(",");
		String state = values[2];
		String cityName = values[3];		
		String zipcode = values[4];
		Double latitude = Double.parseDouble(values[5]);
		Double longitude = Double.parseDouble(values[6]);		
		
		state = state.replaceAll("\"", "");
		cityName = cityName.replaceAll("\"", "");
		zipcode = zipcode.replaceAll("\"", "");

This is very naive String parsing. Nothing fancy here.

		if(!zipcode.isEmpty()) {
			Entity zip = new Entity("Zip", zipcode);
			zip.setProperty("state", state);
			zip.setProperty("city", cityName);
			zip.setProperty("latitude", latitude);
			zip.setProperty("longitute", longitude);
			
			Entity city = new Entity("City", cityName);
			city.setProperty("state", state);
			city.setUnindexedProperty("zip", zipcode);
			
			DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
					.getMutationPool();
			mutationPool.put(zip);
			mutationPool.put(city);
		}

Again, very straightforward if you’ve seen this before. Some zipcodes in our CSV file subset are empty, so we’ll check for that and just not create an Entity. We’re adding 2 entities to the mutation pool here – a City and a Zipcode. This ensures that we can search by key when we do a datastore get. Remember that fetches by key are always faster than fetches with a query, since a query requires an index scan followed by a batch get, whereas the datastore can perform a get in a single operation.

That’s it for our Mapper. Let’s add a configuration:

  <configuration name="Import all data from the Blobstore">
    <property>
      <name>mapreduce.map.class</name>
      
      <!--  Set this to be your Mapper class  -->
      <value>com.ikai.mapperdemo.mappers.ImportFromBlobstoreMapper</value>
    </property>
        
    <!--  This is a default tool that lets us iterate over blobstore data -->
    <property>
      <name>mapreduce.inputformat.class</name>
      <value>com.google.appengine.tools.mapreduce.BlobstoreInputFormat</value>
    </property>
    
    <property>
      <name human="Blob Keys to Map Over">mapreduce.mapper.inputformat.blobstoreinputformat.blobkeys</name>
      <value template="optional">blobkeyhere</value>      
    </property>        
    
    <property>
      <name human="Number of shards to use">mapreduce.mapper.shardcount</name>
      <value template="optional">10</value>      
    </property>        
    
  </configuration>  

We’ve changed 2 properties here: the input format class as well as a property for the blobstore key pointing to the data to iterate over.

Step 4: Deploy!

We can now package our application up and deploy it! Make sure that you built a new JAR file with the new classes in appengine-mapreduce! If you have the old JAR file, it won’t include the BlobstoreInputFormat class that we need to do our work.

Step 5: Using the Mapper

Let’s browse to our upload hander at /upload.jsp. The page should be pretty bare.

Once the upload has finished, we’ll be on a page that looks like this:

Let’s copy the blob-key in the URL. It’s not the most streamlined approach but it works. We’ll use it in the next screen when we browser to our mapper:

We’ll copy-paste the key to replace “blobkeyhere” and hit “Run”. And now we play the waiting game – we’ll be able to check on the status of our Mapper in the UI, or check on Tasks, or look in the datastore to see if the data has been imported correctly:

Get the code

The code is here on Github:

http://github.com/ikai/App-Engine-Java-Mapper-API-demos

It’s been updated with the new examples.

Summary

So there you have it: another way of importing data into the datastore. This isn’t a replacement for the bulk uploader, just another option. Here are some useful links for additional information:

App Engine Mapreduce issues tracker – report issues here

Nick Johnson’s post explaining how he built the code search example

One last tip: the best place for questions or discussion is probably the App Engine Discussion Groups, not the comments.

Happy hacking.

Written by Ikai Lan

August 11, 2010 at 10:33 am

Posted in App Engine, Java, Java

Issuing App Engine datastore queries with the Low-Level API

with 13 comments

Last time, I wrote an introduction to using the low-level API for creating entities, setting keys, and getting keys by value.

Basic queries and sorts

These are useful when we know the keys, but its often very useful to be able to query entities by their properties. Consider the Person entities we created for the last example, Alice and Bob:

Entity alice = new Entity("Alice", "Person");
 alice.setProperty("gender", "female");
 alice.setProperty("age", 20);

 Entity bob = new Entity(“Person”, “Bob”);
 bob.setProperty("gender", "male");
 bob.setProperty("age", "23");

 DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
 datastore.put(alice);
 datastore.put(bob);

Let’s create a query to find the first 10 Persons that are female and sort them by age ascending. How would we write this?

Query findFemalesQuery = new Query("Person");
findFemalesQuery.addFilter("gender", FilterOperator.EQUAL, "female");
findFemalesQuery.addSort("age", SortDirection.ASCENDING);
datastore.prepare(findFemalesQuery).asList(FetchOptions.Builder.withLimit(10));

Here are the steps we took:

  1. Created a Query object, specifying the Query kind
  2. Added a QueryFilter. Note that this is typesafe. We specify the enum representing the FilterOperator we want to use
  3. Added a QuerySort. Again, like the QueryFilter, we select the property to sort on as well as an enum representing either an ascending order or descending order.
  4. We prepare the query. On this result we return it as either an Iterator or as a List of Entities. On this method we can either execute the default query, or we can pass a set of options. In the example above, we use FetchOptions.Builder to set the only option we care about: the limit. We only want 10, so we call withLimit() and pass it 10.

The query interface works well because it’s typesafe where the datastore is typesafe, and not so when the datastore is not – you won’t get errors at runtime because you misspelled “WHERE”, for instance, but you have to be careful not to misspell the properties you are looking for. The flexibility of this interface means that no longer are we constrained by the “every object must have the same bag of properties” frame of thinking. Furthermore, because we don’t need to know the property names apriori (we can use getProperties() and return a Map), we can iterate through this and figure out the keys/value pairs at runtime. This leads to some very powerful abstractions.

Doing a keys only query

It sometimes makes sense for us to only retrieve the keys in a given query. It’s actually incredibly easy, so as long as we know what to expect:

Query findFemalesQuery = new Query("Person");
findFemalesQuery.addFilter("gender", FilterOperator.EQUAL, "female");
findFemalesQuery.addSort("age", SortDirection.ASCENDING);
findFemalesQuery.setKeysOnly();

List<Entity> results = datastore.prepare(findFemalesQuery).asList(
FetchOptions.Builder.withLimit(10));

The only code that’s different in creating the Query object is that we call setKeysOnly(). This still returns a List of entity objects with only the Kind and Key populated. If we wrote a test for this, it would look like this:

Entity alice = results.get(0);
assertEquals("Return Key for Entity", KeyFactory.createKey("Person", "Alice"), alice.getKey());
assertNull("Should not return female property", alice.getProperty("gender"));
assertEquals("Returns Entities with no properties", 0, alice.getProperties().size());

Only the Kind and Key are populated in these Entity objects. Even though the API looks similar, under the hood, the behavior is completely different. Recall how queries work underneath the hood:

  1. Traverse an index and retrieve keys
  2. Using those keys, fetch the entities from the datastore

The time to do a query depends on the index traversal time as well as the number of entities to retrieve. In a keys only query, this is what happens:

  1. Traverse an index and retrieve keys

We completely eliminate step 2 from the process. If all we want is Key information or are counting entities (and the count can be done using only indexes), this is the approach we would take.

Ancestor Queries

Let’s pretend Alice and Bob have child entities:

Entity madHatter = new Entity("Friend", "Mad Hatter", alice.getKey());
Entity doormouse = new Entity("Friend", "Doormouse", alice.getKey());
Entity chesireCat = new Entity("Friend", "Chesire Cat", alice.getKey());

Entity redQueen = new Entity("Friend", "Red Queen", bob.getKey());

datastore.put(madHatter);
datastore.put(doormouse);
datastore.put(chesireCat);
datastore.put(redQueen);

Alice now has Friends Mad Hatter, Doormouse and the Chesire Cat as child entities, while Bob has on the Red Queen. How do we find all friends of Alice or Bob? Like so:

Query friendsOfAliceQuery = new Query("Friend");
friendsOfAliceQuery.setAncestor(alice.getKey());

List<Entity> results = datastore.prepare(friendsOfAliceQuery).asList(FetchOptions.Builder.withDefaults());

Query friendsOfBobQuery = new Query("Friend");
friendsOfBobQuery.setAncestor(bob.getKey());

results = datastore.prepare(friendsOfBobQuery).asList(FetchOptions.Builder.withDefaults());

What’s great about these queries is that the datastore knows exactly where to start. Because keys embed parent Key information – Mad Hatter, Doormouse and the Chesire Cat all have “Alice” as a prefix in their key (this is also why you cannot change an entity’s entity group after creation), we know that we just need to start the query from Alice’s Key and just traverse entities with a Key greater than Alice. It’s also a great way of organizing data. Just be aware that too many transactions on a single entity group will destroy your throughput, so design for as small entity groups as possible.

Summary

Hopefully this blog post explains a few more features of the low-level API. Understanding the low-level API is an important step in understanding the datastore, and understanding the datastore is a critical step for learning how to build efficient, optimized applications for App Engine.

Written by Ikai Lan

July 13, 2010 at 4:43 pm

Google App Engine Tips and Tricks: Prebuilding Indexes using a non default version

leave a comment »

(This’ll be a shorter post than usual.)

Waiting for indexes to build can be drag; indexes need to be built before Entities even exist and can take longer than needed if the global index building workflow is backed up since mass building is a shared resource.

One little known trick is to pre-build indexes before your application needs them by deploying a non-default version. Your application can have many versions. In Java App Engine, this is defined in the version tag of appengine-web.xml. In Python, this is defined in the version YAML element. The Java Eclipse plugin even has a screen where the version can be set (Click the App Engine icon, then “Project Settings”:

Because all applications share the same datastore, the required indexes will be built once your push your application containing the indexes configuration file with the new, required indexes. Hopefully, by the time you are ready to push your real version, the indexes will have completed building.

In general, it is a best practice to maintain a staging version of your application for testing against live data. App Engine makes this so easy it’s trivial: deploy code tagged with a new “version”. Your application is accessible at http://VERSION.latest.APPID.appspot.com (note that VERSION is a String, not a integer or decimal number) – this is a handy and powerful trick to validating a new test or staging version. When you have enough confidence in your application, browse to the Admin Console, click the radio button associated with your new version, and click “Make Default.”

Versioning has never been so easy. No configuring load balancers, rolling deploys, symlinking, restarting edge caches, etc.

Happy hacking.

Written by Ikai Lan

July 12, 2010 at 12:34 pm

Using the Java Mapper Framework for App Engine

with 29 comments

The recently released Mapper framework is the first part of App Engine’s mapreduce offering. In this post, we’ll be discussing some of the types of operations we can perform using this framework and how easily they can be done.

Introduction to Map Reduce

If you aren’t familiar with Map Reduce, read more about it from a high level from Wikipedia here. The official paper can be downloaded from this site if you’re interested in a more technical discussion.

The simplest breakdown of MapReduce is as follows:

  1. Take a large dataset and break it into pieces, mapping individual pieces of data
  2. Work on those mapped datasets and reduce them into the form you need

A simple example here is full text indexing. Suppose we wanted create indexes from existing text documents. We would use the Map step to iterate over every document and “map” each phrase or term to a document, then we would “reduce” the mappings by writing them to an index. Map/reduce problems have the advantage of not only being easy to conceptualize as problems that can be distributed and parallelized, but also because there are frameworks that support many of the administrative functions of map-reduce: failure recovery, distribution of work, tracking status of jobs, reporting and so forth. The appengine-mapreduce project seeks to provide as many of these features as possible while making it as easy as possible for developers to write large batch processing jobs without having to think about the plumbing details.

But I only have Map available!

Yes, this is true  – as of the writing of this post, only the “map” step exists, hence why it’s currently referred to as the “Mapper API”. That doesn’t mean it’s not useful. For starters, it is a very easy way to perform some operation on every single Entity of a given Kind in your datastore in parallel. What would you have to build for yourself if Mapper weren’t available?

  1. Begin querying over every Entity in chained Task Queues
  2. Store beginning and end cursors (introduced in 1.3.5)
  3. Create tasks to work with chunks of your datastore
  4. Write the code to manipulate your data
  5. Build an interface to control your batch jobs
  6. Build a callback system for your multitudes of parallelized workers to call when the entire task has completed

It’s certainly not a trivial amount of work. Some things you can do very easily with the Mapper library include:

  • Modify some property or set of properties for every Entity of a given Kind
  • Delete all entities of a single Kind – the functional equivalent of a “DROP TABLE” if you were using a relational database
  • Count the occurrences of some property across every single Entity of a given Kind in your datastore

We’ll go through a few of these examples in this post.

Our Sample application

Our sample application will be a modified version of the Guestbook demo. We’ll add a few additional properties. For simplicity, we’ll use the low-level API, since the Mapper API also uses the low-level API. You can see this application here:
The code is also available to clone via Github if you’d like to follow along.

How to define a Mapper

There are three steps to defining a Mapper:

  1. Download, build and place the appengine-mapreduce JAR files in your WEB-INF/lib directory and add them to your build path. You only need to do this once per project. The steps for doing this are on the “Getting Started” page for Java. You’ll need all the JAR files that are built.
  2. Make sure that we have a DESCENDING index created on Key. This is important! If we run our Mapper locally, this’ll automatically be created in our datastore-indexes.xml file when we deploy our application. One trick to ensure that indexes get built before they are needed, at least in a live application, is to create and deploy an application with the new index configuration to a non-default version. Because all versions use the same datastore and the same set of indexes, this will schedule the index to be built before we need it in the live version. When it has completed, we simply switch the default version over, and we’re ready to roll.
  3. Create your Mapper class
  4. Configure your Mapper class in mapreduce.xml

We’ll go over steps 3 and 4 in each example.

Example 1: Changing a property on every Entity (Naive way)

(You can even use this technique if you just need to change a property on a large set of Entities).

Assuming you’ve already set up your environment for the Mapper servlet, you can dive right in. Let’s create a Mapper classes that goes through every Entity of a given Kind and converts the “comment” property to use all lowercase letters. We’ll also add a timestamp for when we modified this Entity. In this first example, we’ll do this the naive way. This is a very good way to introduce you to very simple mutations on all your Entities using Mapper.

Note that this requires some familiarity with the Low-Level API. Don’t worry – entities edited or saved using the low-level API are accessible via managed persistence interface such as JDO/JPA (and vice versa). If you aren’t familiar with the low-level API, you can read more about it here on the Javadocs.

The first thing we’ll have to do is define a Mapper. We tried as much as possible to mimic Hadoop’s Mapper class. We’ll be subclassing AppEngineMapper, which is itself a subclass of Hadoop’s Mapper. The meat of this class is the map() method, which we’ll be overriding. We’ll also override the taskSetup() lifecycle callback. We’ll be using this to initialize our DatastoreService, though we could probably initialize it in the body of the map() method itself. The other methods are taskCleanup(), setup() and cleanup() – examples here. Let’s have a look at our code below:

package com.ikai.mapperdemo.mappers;

import java.util.Date;
import java.util.logging.Logger;

import org.apache.hadoop.io.NullWritable;

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;

/**
 *
 * This mapper changes all Strings to lowercase Strings, sets
 * a timestamp, and reputs them into the Datastore. The reason
 * this is a "Naive" Mapper is because it doesn't make use of
 * Mutation Pools, which can do these operations in batch instead
 * of individually.
 *
 * @author Ikai Lan
 *
 */
public class NaiveToLowercaseMapper extends
		AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
	private static final Logger log = Logger
			.getLogger(NaiveToLowercaseMapper.class.getName());

	private DatastoreService datastore;

	@Override
	public void taskSetup(Context context) {
		this.datastore = DatastoreServiceFactory.getDatastoreService();
	}

	@Override
	public void map(Key key, Entity value, Context context) {
		log.info("Mapping key: " + key);

		if (value.hasProperty("comment")) {
			String comment = (String) value.getProperty("comment");
			comment = comment.toLowerCase();
			value.setProperty("comment", comment);
			value.setProperty("updatedAt", new Date());

			datastore.put(value);

		}
	}
}

Notice that this map method takes 3 parameters:

Key key – this is the datastore Key for the Entity we are about to perform an operation on. Mostly this exists for API compatibility with Hadoop, but we don’t really need it yet. For iterating over datastore Entities, we don’t really need this, because we *could* use this to look up the Entity, but we don’t have to because …

Entity value – … because we actually get the Entity already. If we did a lookup for the Entity, we’d double the amount of lookups we do per Entity. We can certainly use the Key to do a lookup using a PersistenceManager or EntityManager and have a populated, typesafe Entity object, but from an efficiency standpoint we’d be doubling our work for some JDO/JPA sugar.

Context context – We don’t need this in our example, but it’s easy to think of the Context as giving us access to “global” values such as temporary variables and configuration files. For a later example in this post, we’ll be using the Context to store a global value in a counter and increment it. For this example, it’s unused.

If you’re familiar at all with the low-level API, this will look very straightfoward (again, I highly encourage you to read the docs). We take an entity, add 2 properties to it, then re-put() the Entity back into the datastore.

Now let’s add this job to mapreduce.xml:

<configurations>
  <configuration name="Naive Mass toLowercase()">
    <property>
      <name>mapreduce.map.class</name>

      <!--  Set this to be your Mapper class  -->
      <value>com.ikai.mapperdemo.mappers.NaiveToLowercaseMapper</value>
    </property>

    <!--  This is a default tool that lets us iterate over datastore entities -->
    <property>
      <name>mapreduce.inputformat.class</name>
      <value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
    </property>

    <property>
      <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
      <value template="optional">Comment</value>
    </property>
  </configuration>
</configurations>

It looks complex, but it’s really not. We define a configuration element and name the job. The name of the job is also the name we’ll see in the GUI when we fire off the job. We need 3 sets of property elements under this element, which are just name/value pairs. Let’s go over each one we used:

Name: mapreduce.map.class
Value: com.ikai.mapperdemo.mappers.NaiveToLowercaseMapper
This one is straightforward – we provide the name of an AppEngineMapper subclass with the map() method we want run.

Name: mapreduce.inputformat.class
Value: com.google.appengine.tools.mapreduce.DatastoreInputFormat
This is a class that takes some input to map over. DatastoreInputFormat is provided by appengine-mapreduce, but it is possible for us to define our own input formatter. For guidance, check out the source of DatastoreInputFormat here.

In a more advanced example (ahem, future blog post), we’ll discuss building our own InputFormat to read from another source such as the Blobstore. For our examples in this post, we won’t need anything beyond DatastoreInputFormat.

Name: mapreduce.mapper.inputformat.datastoreinputformat.entitykind
Value: Comment
This input is specific to DatastoreInputFormat. It tells DatastoreInputFormat which Entity Kind to iterate over. Note that in the mapper console, a user can type in the name of a Kind or edit this Field to reflect the value they want. We can’t leave this blank, though, if we want this to work.

If we browse to the URI at which we’ve defined the Mapper console (in our case /mapper), we see something that looks like this:

“Running jobs” appears when we click “Run”. We can click “Detail” to see the progress of our job, or we can “Abort” to quit the job. Note that aborting a job won’t revert our Entities! We’ll end up with a partially run job if we run a giant mutation, so we’ll have to be cognizant of this when we use this tool.

When the job completes, we’ll take a look at our Comments. Sure enough, they are now all lowercase.

Example 2: Changing a property on every Entity using Mutation Pools

There’s a reason the Mapper in Example 1 is called a Naive Mapper: because it doesn’t take advantage of mutation pools. As we all know, App Engine’s datastore is capable of handling operations in parallel using batched calls. We’re already doing work in parallel by specifying shards, but we’ll want to use batched calls when possible. We do this by adding the mutations we want to a mutation pool, then, periodically as the pool hits a certain size, we flush all the writes to the datastore with a single call instead of individually. This has the advantage of making our map() call as fast as possible, since all we’re really doing is making a list of operations to perform all at once when the system is good and ready. Let’s define the XML file first assuming we call the class PooledToLowercaseMapper:

  <configuration name="Mass toLowercase() with Mutation Pool">
    <property>
      <name>mapreduce.map.class</name>

      <!--  Set this to be your Mapper class  -->
      <value>com.ikai.mapperdemo.mappers.PooledToLowercaseMapper</value>
    </property>

    <!--  This is a default tool that lets us iterate over datastore entities -->
    <property>
      <name>mapreduce.inputformat.class</name>
      <value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
    </property>

    <property>
      <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
      <value template="optional">Comment</value>
    </property>

  </configuration>

It looks almost exactly the same. That’s because the meat is in what we do in the actually class itself:

package com.ikai.mapperdemo.mappers;

import java.util.Date;
import java.util.logging.Logger;

import org.apache.hadoop.io.NullWritable;

import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
import com.google.appengine.tools.mapreduce.DatastoreMutationPool;

/**
 *
 * The functionality of this is exactly the same as in {@link NaiveToLowercaseMapper}.
 * The advantage here is that since a {@link DatastoreMutationPool} is used, mutations
 * can be done in batch, saving API calls.
 *
 * @author Ikai Lan
 *
 */
public class PooledToLowercaseMapper extends
		AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
	private static final Logger log = Logger
			.getLogger(PooledToLowercaseMapper.class.getName());

	@Override
	public void map(Key key, Entity value, Context context) {
		log.info("Mapping key: " + key);

		if (value.hasProperty("comment")) {
			String comment = (String) value.getProperty("comment");
			comment = comment.toLowerCase();
			value.setProperty("comment", comment);
			value.setProperty("updatedAt", new Date());

			DatastoreMutationPool mutationPool = this.getAppEngineContext(
					context).getMutationPool();
			mutationPool.put(value);
		}
	}
}

Everything looks example the same until these lines:

DatastoreMutationPool mutationPool = this.getAppEngineContext(context).getMutationPool();
mutationPool.put(value);

Aha! So we finally put the context to use. Granted, we use the context as a parameter to another, more useful method, but at least we’re using it.  We acquire a DatastoreMutationPool using the getAppEngineContext(context).getMutationPool() method, then we just call put() and pass the changed entity. DatastoreMutationPool is defined here and is open source.

The interface is similar to that of DatastoreService. There’s not a lot of fancy stuff going on here. put(), as we’ve seen, is defined. get() isn’t, because, well, that method makes no sense in this context. delete() is defined, which brings me to my bonus section:

Bonus Example 2: Delete all Entities of a given Kind

One of the most common questions asked in the group is, “How do I drop table?” Usually, this question is asked by new App Engine developers who don’t yet understand that the datastore is a distributed key-value store and not a relational database. But it’s also a legitimate use case. What if you just wanted to nuke all Entities of a given Kind? Prior to Mapper, you would have had to write your own handler to take care of this. Mapper makes this very easy. Here’s what a generic “DeleteAllMapper” would look like. This will work with *any* Entity Kind:

package com.ikai.mapperdemo.mappers;

import java.util.logging.Logger;

import org.apache.hadoop.io.NullWritable;

import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
import com.google.appengine.tools.mapreduce.DatastoreMutationPool;

/**
 *
 * This Mapper deletes all Entities of a given kind. It simulates the
 * DROP TABLE functionality asked for by developers.
 *
 * @author Ikai Lan
 *
 */
public class DeleteAllMapper extends
		AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
	private static final Logger log = Logger.getLogger(DeleteAllMapper.class
			.getName());

	@Override
	public void map(Key key, Entity value, Context context) {
		log.info("Adding key to deletion pool: " + key);
		DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
				.getMutationPool();
		mutationPool.delete(value.getKey());
	}
}

That’s it! We wire it up the same way we wire up other Mappers:


  <configuration name="Delete all Entities">
    <property>
      <name>mapreduce.map.class</name>

      <!--  Set this to be your Mapper class  -->
      <value>com.ikai.mapperdemo.mappers.DeleteAllMapper</value>
    </property>

    <!--  This is a default tool that lets us iterate over datastore entities -->
    <property>
      <name>mapreduce.inputformat.class</name>
      <value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
    </property>

    <property>
      <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
      <value template="optional">Comment</value>
    </property>
  </configuration>

I’ve separated each out into its own mapreduce.xml, but this isn’t necessary. A given App Engine project can have multiple configuration elements defined. That’s why there’s a dropdown list in the Mapreduce console GUI.

Example 3: Taking more user input in the Mapper console and counting

Our next example covers using counters in the context. Let’s say that we wanted to allow the User to enter a String, then we iterate over every Entity searching for occurrences of that Substring on-the-fly and not with pre-built indexes. First, let’s discuss the XML configuration we use:

  <configuration name="Count words in all Comments">
    <property>
      <name>mapreduce.map.class</name>

      <!--  Set this to be your Mapper class  -->
      <value>com.ikai.mapperdemo.mappers.CountWordsMapper</value>
    </property>

    <property>
    	<!--  This is the URL to call after the entire Mapper has run -->
    	<name>mapreduce.appengine.donecallback.url</name>
    	<value>/callbacks/word_count_completed</value>
    </property>

    <!--  This is a default tool that lets us iterate over datastore entities -->
    <property>
      <name>mapreduce.inputformat.class</name>
      <value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
    </property>

    <property>
      <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
      <value template="optional">Comment</value>
    </property>

  </configuration>

There’s one new name/value pair:
Name: mapreduce.mapper.counter.substringtarget
Value: Substring
We can pick any name or value we want. We just pick this one because it makes sense. We’ll retrieve this value in the Mapper via the Context. This causes an extra text field to appear in the Mapper console:


The Mapper is below:

package com.ikai.mapperdemo.mappers;

import java.util.logging.Logger;

import org.apache.hadoop.io.NullWritable;

import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;

/**
 *
 * This Mapper takes some input and counts the number of Comments which
 * contain that substring.
 *
 * @author Ikai Lan
 *
 */
public class SubstringMatcherMapper extends
		AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
	private static final Logger log = Logger.getLogger(SubstringMatcherMapper.class
			.getName());

	/*
	 * Get the target that we want to match on and count the number of Comments that
	 * match it
	 */
	@Override
	public void map(Key key, Entity value, Context context) {

	    String substringToMatch = context.getConfiguration().get("mapreduce.mapper.counter.substringtarget");

		String comment = (String) value.getProperty("comment");
		if (comment != null) {
			if(comment.contains(substringToMatch)) {
				log.info("Found match in: " + comment);
				context.getCounter("SubstringMatch", "count").increment(1);
			}
		}

	}
}

We retrieve the value entered by the user with this line of code:

context.getConfiguration().get("mapreduce.mapper.counter.substringtarget");

If the comment we’re current working on contains the substring, we want to increment our count. The context object has a getCounter() method that returns a counter we can increment or decrement:

context.getCounter("SubstringMatch", "count").increment(1);

When our job completes running, we can see the total count when we click “Detail” on the completed job:

More likely than not, however, we’ll want to store this number back in the datastore or do something with it besides stick it into a status page. Good that we mention that …

Example 4: Completion callbacks and JobContexts

Let’s modify Example 3 a bit. Suppose now we want to count the total number of words across all comments. We’ll need to use a counter. But suppose that instead of just displaying it in a console page, we want that number to get stored into the datastore again. Much like Task Queues, incoming email and XMPP, the callback is event driven, and therefore uses the HTTP request to an app URI model to dispatch. That is – we’ll define a servlet with a doPost() handler and read the input out of the parameters.

The first thing we’ll need to do is configure our Mapper to fire off the callback when done. We do this in mapreduce.xml:

  <configuration name="Count substring matches in all Comments">
    <property>
      <name>mapreduce.map.class</name>

      <!--  Set this to be your Mapper class  -->
      <value>com.ikai.mapperdemo.mappers.SubstringMatcherMapper</value>
    </property>

    <!--  This is a default tool that lets us iterate over datastore entities -->
    <property>
      <name>mapreduce.inputformat.class</name>
      <value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
    </property>

    <property>
      <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
      <value template="optional">Comment</value>
    </property>

    <property>
      <name human="Search for substring">mapreduce.mapper.counter.substringtarget</name>
      <value template="optional">Substring</value>
    </property>

  </configuration>

Here’s the property we care about:

Name: mapreduce.appengine.donecallback.url
Value: /callbacks/word_count_completed

The value of this can map to any URI in your application. Just be sure that URI points to the Servlet that will be handling your callback. Let’s define the Mapper class:

package com.ikai.mapperdemo.mappers;

import java.util.logging.Logger;

import org.apache.hadoop.io.NullWritable;

import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;

/**
 *
 * This mapper counts the number of total words across all comments. It cheats a
 * bit by just splitting on whitespace and just using the length. This mapper
 * demonstrates use of counters as well as using a completion callback.
 *
 * @author Ikai Lan
 *
 */
public class CountWordsMapper extends
		AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
	private static final Logger log = Logger.getLogger(CountWordsMapper.class
			.getName());

	/*
	 * This is a bit of a lazy implementation more to prove a point than to
	 * actually be correct. Split on whitespace, count words
	 */
	@Override
	public void map(Key key, Entity value, Context context) {

		String comment = (String) value.getProperty("comment");
		if (comment != null) {
			String[] words = comment.split("\\s+");
			int wordCount = words.length;

			// Takes a "group" and a "counter"
			// We'll use these later to store the final count back in the
			// datastore
			context.getCounter("CommentWords", "count").increment(wordCount);
		}

	}
}

Not a lot that’s new here. We use the context again to store a counter. Note that we can increment by any value, not just 1.

Let’s take a look at what our servlet looks like that handles this callback:

package com.ikai.mapperdemo.servlets;

import java.io.IOException;
import java.util.Date;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.CounterGroup;
import org.apache.hadoop.mapreduce.Counters;
import org.apache.hadoop.mapreduce.JobID;

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.EntityNotFoundException;
import com.google.appengine.tools.mapreduce.MapReduceState;
import com.ikai.mapperdemo.mappers.CountWordsMapper;

/**
 * This is the servlet that takes care of any processing we have to do after we
 * have finished running {@link CountWordsMapper}.
 *
 * This is just a standard servlet - we can do anything we want here. We can use
 * any App Engine API such as email or XMPP, for instance, to notify an
 * administrator. We could also store a final state into the datastore - in
 * fact, that is what this example below does.
 *
 * @author Ikai Lan
 *
 */
@SuppressWarnings("serial")
public class WordCountCompletedCallbackServlet extends HttpServlet {

	public void doPost(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		String jobIdName = req.getParameter("job_id");
		JobID jobId = JobID.forName(jobIdName);

		// A future iteration of this will likely contain a default
		// option if we don't care which DatastoreService instance we use.
		DatastoreService datastore = DatastoreServiceFactory
				.getDatastoreService();
		try {

			// We get the state back from the job_id parameter. The state is
			// serialized and stored in the datastore, so we pass an instance
			// of the datastore service.
			MapReduceState mrState = MapReduceState.getMapReduceStateFromJobID(
					datastore, jobId);

			// There's a bit of ceremony to get the actual counter. This
			// example is intentionally verbose for clarity. First get all the
			// Counters,
			// then we get the CounterGroup, then we get the Counter, then we
			// get the count.
			Counters counters = mrState.getCounters();
			CounterGroup counterGroup = counters.getGroup("CommentWords");
			Counter counter = counterGroup.findCounter("count");
			long wordCount = counter.getValue(); // Finally!

			// Let's create a special datastore Entity for this value so
			// we can reference it on the ViewComments page
			Entity totalCountEntity = new Entity("TotalWordCount",
					"total_word_count");
			totalCountEntity.setProperty("count", wordCount);

			// Now we timestamp this bad boy
			totalCountEntity.setProperty("updatedAt", new Date());
			datastore.put(totalCountEntity);

		} catch (EntityNotFoundException e) {
			throw new IOException("No datastore state");
		}

	}

}

The JobID comes as a String parameter. We get it like so:

String jobIdName = req.getParameter("job_id");
JobID jobId = JobID.forName(jobIdName);

Be aware of the imports used. Your IDE may import the wrong class, as there is a deprecated JobID and a non-deprecated version.

Once you have the JobID, you use it to retrieve the MapReduceState:

MapReduceState mrState = MapReduceState.getMapReduceStateFromJobID(datastore, jobId);

From the MapReduceState object, we have to perform a bit of a ceremony to get what we want. We need to:

1. Fetch the Counters from the MapReduceState
2. Fetch the appropriate CounterGroup from the Counters object
3. Fetch the named Counter from the CounterGroup
4. Fetch the value from the Counter

Here’s what it looks like in code:

Counters counters = mrState.getCounters();
CounterGroup counterGroup = counters.getGroup("CommentWords");
Counter counter = counterGroup.findCounter("count");
long wordCount = counter.getValue();

We can now do what we want with this count. In our servlet example, we save it to a datastore Entity and use it later on.

Get the code

You’re undoubtedly ready to start playing with this thing. You’ve got everything you need to know. First, here’s the getting started page for appengine-mapreduce in Java:

Here’s my sample source code on GitHub.

Summary

So there you have it: an easy to use tool for mapping operations across entire Entity Kinds. There are still a lot of topics to cover, and we’ll likely explore them in a future article. For instance, I didn’t have a chance to cover building your own InputFormat class. We’re still hard at work extending this framework (such as the “Shuffle” and “Reduce” phases), so please post your feedback in the App Engine groups or file bugs in the issue tracker.

Written by Ikai Lan

July 9, 2010 at 3:35 pm

Posted in App Engine, Java, Java

Using Asynchronous URLFetch on Java App Engine

with 9 comments

Developers building applications on top of Java App Engine can use the familiar java.net interface for making off-network calls. For simple requests, this should be more than sufficient. The low-level API, however, provides one feature not available in java.net: asynchronous URLFetch.

The low-level URLFetch API

So what does the low-level API look like? Something like this:

import java.io.IOException;
import java.net.URL;
import java.util.List;

import com.google.appengine.api.urlfetch.HTTPHeader;
import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		try {
			URL url = new URL("http://someurl.com");
			HTTPResponse response = fetcher.fetch(url);

			byte[] content = response.getContent();
                     // if redirects are followed, this returns the final URL we are redirected to
			URL finalUrl = response.getFinalUrl();

			// 200, 404, 500, etc
			int responseCode = response.getResponseCode();
			List headers = response.getHeaders();

			for(HTTPHeader header : headers) {
				String headerName = header.getName();
				String headerValue = header.getValue();
			}

		} catch (IOException e) {
			// new URL throws MalformedUrlException, which is impossible for us here
		}

The full Javadocs are here.

So it’s a bit different than the standard java.net interface, where we’d get back a reader and iterate line by line on the response. We’re protected from a heap overflow because URLFetch is limited to 1mb responses.

Asynchronous vs. Synchronous requests

Using java.net has the advantage of portability – you could build a standard fetcher that will work in any JVM environment, even those outside of App Engine. The tradeoff, of course, is that App Engine specific features won’t be present. The one killer feature of App Engine’s low-level API that isn’t present in java.net is asynchronous URLFetch. What is asynchronous fetch? Let’s make an analogy:

Let’s pretend you, like me at home, are on DSL and have a pretty pathetic downstream speed, and decide to check out a link sharing site like Digg. You browse to the front page and decide to open up every single link. You can do this synchronously or asynchronously.

Synchronously

Synchronously, you click link #1. Now you look at this page. When you are done looking at this page, you hit the back button and click link #2. Repeat until you have seen all the pages. Now, again, you are on DSL, so not only do you spend time reading each link, before you read each destination page, you have to wait for the page to load. This can take a significant portion of your time. The total amount of time you sit in front of your computer is thus:

N = number of links
L = loading time per page
R = reading time per page

N * (L + R)

(Yes, before I wrote this equation, I thought that by including some mathematical formulas in my blog post would make me look smarter, but as it turns out the equation is something comprehensible by 8 year olds internationally/American public high school students)

Asynchronously

Using a tabbed browser, you control-click every single link on the page to open them in new tabs. Now before you go look at any of the pages, you decide to go to the kitchen and make a grilled cheese sandwich. When you are done with the sandwich, you come back to your computer sit down, enjoying your nice, toasty sandwich while you read articles about Ron Paul and look at funny pictures of cats. How much time are you spending?

N = number of links
S = loading time for the slowest loading page
R = reading time per page
G = time to make a grilled cheese sandwich
MAX((N * R + G), (N * R + S))

Which takes longer, your DSL, or the time it takes you to make a grilled cheese sandwich? The point that I’m making here is that you can save time by parallelizing things. No, I know it’s not a perfect analogy, as downloading N pages in parallel consumes the same crappy DSL connection, but you get what I am trying to say. Hopefully. And maybe you are also in the mood for some cheese.

Asynchrous URLFetch in App Engine

So what would the URLFetch above look like asynchronously? Probably something like this:

import java.io.IOException;
import java.net.URL;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

import com.google.appengine.api.urlfetch.HTTPHeader;
import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

	protected void makeAsyncRequest() {
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		try {
			URL url = new URL("http://someurl.com");
			Future future = fetcher.fetchAsync(url);

			// Other stuff can happen here!

			HTTPResponse response = future.get();
			byte[] content = response.getContent();
			URL finalUrl = response.getFinalUrl();
			int responseCode = response.getResponseCode();
			List headers = response.getHeaders();

			for(HTTPHeader header : headers) {
				String headerName = header.getName();
				String headerValue = header.getValue();
			}

		} catch (IOException e) {
			// new URL throws MalformedUrlException, which is impossible for us here
		} catch (InterruptedException e) {
			// Exception from using java.concurrent.Future
		} catch (ExecutionException e) {
			// Exception from using java.concurrent.Future
			e.printStackTrace();
		}

	}

This looks pretty similar – EXCEPT: fetchAsync doesn’t return an HTTPResponse. It returns a Future. What is a future?

java.concurrent.Future

From the Javadocs:

“A Future represents the result of an asynchronous computation. Methods are provided to check if the computation is complete, to wait for its completion, and to retrieve the result of the computation. The result can only be retrieved using method get when the computation has completed, blocking if necessary until it is ready. Cancellation is performed by thecancel method. Additional methods are provided to determine if the task completed normally or was cancelled.”

Future

What does this mean in English? It means that the Future object is NOT the response of the HTTP call. We can’t get the response until we call the get() method. Between when we call fetchAsync() and get, we can do other stuff: datastore operations, insert things into the Task Queue, heck, we can even do more URLFetch operations. When we finally DO call get(), one of two things happens:

  1. We’ve already retrieved the URL. Return an HTTPResponse object
  2. We’re still retrieving the URL. Block until we are done, then return an HTTPResponse object.

In the best case scenario, the amount of time it takes for us to do other things is equal to the amount of time it takes to do the URLFetch, and we save time. In the worst case scenario, the maximum amount of time is the time it takes to do the URLFetch or the other operations, whichever takes longer. The end result is that we lower the amount of time to return a response to the end-user.

Twitter Example

So let’s build a servlet that retrieves my tweets. Just for giggles, let’s do it 20 times and see what the performance difference is. We’ll make it so that if we pass a URL parameter, async=true (or anything, for simplicity), we do the same operation using fetchAsync. The code is below:

package com.ikai.urlfetchdemo;

import java.io.IOException;
import java.io.PrintWriter;
import java.net.URL;
import java.util.ArrayList;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

@SuppressWarnings("serial")
public class GetTwitterFeedServlet extends HttpServlet {

	protected static String IKAI_TWITTER_RSS = "http://twitter.com/statuses/user_timeline/14437022.rss";

	public void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		boolean isSyncRequest = true;

		if(req.getParameter("async") != null) {
			isSyncRequest = false;
		}

		resp.setContentType("text/html");
		PrintWriter out = resp.getWriter();
		out.println("<h1>Twitter feed fetch demo</h1>");

		long startTime = System.currentTimeMillis();
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		URL url = new URL(IKAI_TWITTER_RSS);

		if(isSyncRequest) {
			out.println("<h2>Synchronous fetch</h2>");
			for(int i = 0; i < 20; i++) {
				HTTPResponse response = fetcher.fetch(url);
				printResponse(out, response);
			}
		} else {
			out.println("<h2>Asynchronous fetch</h2>");
			ArrayList<Future<HTTPResponse>> asyncResponses = new ArrayList<Future<HTTPResponse>>();
			for(int i = 0; i < 20; i++) {
				Future<HTTPResponse> responseFuture = fetcher.fetchAsync(url);
				asyncResponses.add(responseFuture);
			}

			for(Future<HTTPResponse> future : asyncResponses){
				HTTPResponse response;
				try {
					response = future.get();
					printResponse(out, response);
				} catch (InterruptedException e) {
					// Guess you would do something here
				} catch (ExecutionException e) {
					// Guess you would do something here
				}
			}

		}

		long totalProcessingTime = System.currentTimeMillis() - startTime;
		out.println("<p>Total processing time: " + totalProcessingTime + "ms</p>");
	}

	private void printResponse(PrintWriter out, HTTPResponse response) {
		out.println("<p>");
		out.println("Response: " + new String(response.getContent()));
		out.println("</p>");
	}

}

As you can see, it’s a bit more involved to store all the Futures in a list, then to iterate through them. We’re also not being too intelligent about iterating through the futures: we’re assuming first-in-first-out (FIFO) with URLFetch, which may or may not be the case in production. A more optimized case may try to fetch the response from a call we know is faster before fetching from one we know is slower first. However – empirical testing will show that more often than not, doing things asynchronously is significantly faster for the user than synchronously.

Using Asynchronous URLFetch and HTTP POST

So far, our examples have been focused on read operations. What if we don’t care about the response? For instance, what if we decide to make an API call that essentially is a “write” operation, and can, for the most part, safely assume it will succeed?

// JavaAsyncUrlFetchDemoServlet.java
package com.ikai.urlfetchdemo;

import java.io.IOException;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

@SuppressWarnings("serial")
public class JavaAsyncUrlFetchDemoServlet extends HttpServlet {

	public void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		long startTime = System.currentTimeMillis();
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		fetcher.fetchAsync(FetchHelper.makeGuestbookPostRequest("Async", "At" + startTime));
		long totalProcessingTime = System.currentTimeMillis() - startTime;

		resp.setContentType("text/html");
		resp.getWriter().println("<h1>Asynchronous fetch demo</h1>");
		resp.getWriter().println("<p>Total processing time: " + totalProcessingTime + "ms</p>");
	}

}
// FetchHelper.java
package com.ikai.urlfetchdemo;

import java.net.MalformedURLException;
import java.net.URL;

import com.google.appengine.api.urlfetch.HTTPMethod;
import com.google.appengine.api.urlfetch.HTTPRequest;

public class FetchHelper {

	protected static final String signGuestBookUrl = //"http://bootcamp-demo.appspot.com/sign";

	public static HTTPRequest makeGuestbookPostRequest(String name, String content){
		HTTPRequest request = null;
		URL url;
		try {
			url = new URL(signGuestBookUrl);
			request = new HTTPRequest(url, HTTPMethod.POST);
			String body = "name=" + name + "&amp;content=" + content;
			request.setPayload(body.getBytes());

		} catch (MalformedURLException e) {
			// Do nothing
		}
		return request;
	}
}

I’ve decided to spam my own guestbook here, for better or for worse.

Download the code

You can download the code from this post here using git: http://github.com/ikai/Java-App-Engine-Async-URLFetch-Demo

Written by Ikai Lan

June 29, 2010 at 2:49 pm

Posted in Uncategorized

Using the bulkloader with Java App Engine

with 32 comments

The latest release of the datastore bulkloader greatly simplifies import and export of data from App Engine applications for developers. We’ll go through a step by step example for using this tool with a Java application. Note that only setting up Remote API is Java specific – everything can be used with Python applications. Unlike certain phone companies, this is one store that doesn’t care what language your application is written in.

Checking for our Prerequisites:

If you already have Python 2.5.x and the Python SDK installed, skip this section.

First off, we’ll need to download the Python SDK. This example assumes we have Python version 2.5.x installed. If you’re not sure what version you have installed, open up a terminal and type “python”. This opens up a Python REPL, with the first line displaying the version of Python you’re using. Here’s example output:

Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

(Yes, Pythonistas, the version on my laptop is ooooooooold).

Download the Python SDK from the following link. As of the writing of this post, the newest version is 1.3.4: Direct link.

Unzip this file. It’ll be easier for you if you place this in your path. Linux and OS X users will append this in their ~/.bash_profile:

PATH="/path/to/where/you/unzipped/appengine:${PATH}"
export PATH

To test that everything is working, type

appcfg.py

You’ll see a page of usage commands that starts out like this:

Usage: appcfg.py [options] <action>

Action must be one of:
create_bulkloader_config: Create a bulkloader.yaml from a running application.
cron_info: Display information about cron jobs.
download_data: Download entities from datastore.
help: Print help for a specific action.
request_logs: Write request logs in Apache common log format.
rollback: Rollback an in-progress update.
update: Create or update an app version.
update_cron: Update application cron definitions.
update_dos: Update application dos definitions.
update_indexes: Update application indexes.
update_queues: Update application task queue definitions.
upload_data: Upload data records to datastore.
vacuum_indexes: Delete unused indexes from application.
Use 'help <action>' for a detailed description.

…. (and so forth)

Now we can go ahead and start using the bulkloader.

Using the bulkloader for Java applications

Before we can begin using the bulkloader, we’ll have to set it up first. Setting up the bulkloader is a three step process. We’ll need to:

1. Add RemoteApi to mapping
2. Generate a bulkloader configuration

Step 1: Add RemoteApi to our URI mapping

We’ll want to edit our web.xml. Add the following lines:

    <servlet>
        <servlet-name>RemoteApi</servlet-name>
        <servlet-class>com.google.apphosting.utils.remoteapi.RemoteApiServlet</servlet-class>
    </servlet>

    <servlet-mapping>
        <servlet-name>RemoteApi</servlet-name>
        <url-pattern>/remote_api</url-pattern>
    </servlet-mapping>

A common pitful with setting up RemoteApi is that developers using frameworks will use a catch-all expression for mapping URIs, and this will stomp over our servlet-mapping. Deploy this application into production. We’ll likely want to put an admin constraint on this.

Step 2: Generate a bulkloader configuration

This step isn’t actually *required*, but it certainly makes our lives easier, especially if we are looking to export existing data. In a brand new application, if we are looking to bootstrap our application with data, we don’t need this step at all. For completeness, however, it’d be best to go over it.

We’ll need to generate a configuration template. This step depends on datastore statistics having been updated with the Entities we’re looking to export. Log in to appspot.com and click “Datastore Statistics” under Datastore in the right hand menu.

If we see something that looks like the following screenshot, we can use this tool.

If we see something that looks like the screenshow below, then we can’t autogenerate a configuration since this is a brand new application – that’s okay, that means we probably don’t have much data to export. We’ll have to wait for App Engine’s background tasks to bulk update our statistics before we’ll be able to complete this step.

Assuming that we have datastore statistics available, we can use appcfg.py in the following manner to generate a configuration file:

appcfg.py create_bulkloader_config --url=http://APPID.appspot.com/remote_api --application=APPID --filename=config.yml

If the datastore isn’t ready, running this command will cause the following error:

[ERROR   ] Unable to download kind stats for all-kinds download.
[ERROR   ] Kind stats are generated periodically by the appserver
[ERROR   ] Kind stats are not available on dev_appserver.

I’m using this on a Guestbook sample application I wrote for a codelab a while ago. The only Entities are Greetings, which consists of a String username, a String comment and a timestamp. This is what my config file looks like:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Greeting
  connector:

  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      import_transform: transform.key_id_or_name_as_string

    - property: content
      external_name: content
      # Type: String Stats: 7 properties of this type in this kind.

    - property: date
      external_name: date
      # Type: Date/Time Stats: 7 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

    - property: name
      external_name: name
      # Type: String Stats: 7 properties of this type in this kind.

We care about the connector. Replace that with the following:

- kind: Greeting
    connector: csv

We’ve only filled in the “connector” option. Now we have something we can use to dump data.

Examples of common usages of the bulkloader

Downloading data

We’ve got what we need to dump data. Let’s go ahead and do that now. Issue the following command:

appcfg.py download_data --config_file=config.yml --filename=data.csv --kind=Greeting --url=http://APPID.appspot.com/remote_api --application=APPID

We’ll be asked to provide our email and password credentials. Here’s what my console output looks like:

Downloading data records.
[INFO    ] Logging to bulkloader-log-20100609.162353
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100609.162353.sql3
[INFO    ] Opening database: bulkloader-results-20100609.162353.sql3
[INFO    ] Connecting to java.latest.bootcamp-demo.appspot.com/remote_api
2010-06-09 16:23:57,022 WARNING appengine_rpc.py:399 ssl module not found.
Without the ssl module, the identity of the remote host cannot be verified, and
connections may NOT be secure. To fix this, please install the ssl module from
http://pypi.python.org/pypi/ssl .
To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl .
Please enter login credentials for java.latest.bootcamp-demo.appspot.com
Email: YOUR EMAIL
Password for YOUR_EMAIL:
[INFO    ] Downloading kinds: ['Greeting']
.[INFO    ] Greeting: No descending index on __key__, performing serial download
.
[INFO    ] Have 17 entities, 0 previously transferred
[INFO    ] 17 entities (11304 bytes) transferred in 10.5 seconds

There’s now a CSV file named data.csv in my directory, as well as a bunch of autogenerated bulkloader-* files for resuming if the loader dies midway during the export. My CSV file starts like this:

content,date,name,key
Hey it works!,2010-05-18T22:35:17,Ikai Lan,1
… (More lines of CSV)

The first line is a labeling line – this line designates the order in which properties have been exported. In our case, we’ve exported content, date and name in addition to Entity keys.

Uploading Data

To upload the CSV file back into the datastore, we run the following command:

appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting

This’ll use config.yml and create our entities in the remote datastore.

Adding a new field to datastore entities

One question that is frequently asked in the groups is, “How do I migrate my schema?” This question is generally poorly phrased; App Engine’s datastore is schemaless. That is – it is possible to have Entities of the same Kind with completely different sets of properties. Most of the time, this is a good thing. MySQL, for instance, requires a table lock to do a schema update. By being schema free, migrations can happen lazily, and application developers can check at runtime for whether a Property exists on a given Entity, then create or set the value as needed.

But there are times when this isn’t sufficient. One use case is if we want to change a default value on Entities and grandfather older Entities to the new default value, but we also want the default value to possibly be null. We can do tricks such as creating a new Property, setting an update timestamp, checking for whether the update timestamp is before or after when we made the code change and update conditionally, and so forth. The problem with this approach is that it introduces a TON of complexity into our application, and if we have more than one of these “migrations”, suddenly we’re writing more code to lazily grandfather data and confusing the non-Cylons that work on our team. It’s easier to migrate all the data. So how we do this? Before the new application code goes live, we migrate the schema by adding the new field. The best part about this is that we can do this without locking tables, so writes can continue.

Let’s add a new String field to our Greeting class: homepageUrl. Let’s assume that we want to set a default to http://www.google.com. How would we do this? Let’s update our config.yml file to the following:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Greeting
  connector: csv

  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      import_transform: transform.key_id_or_name_as_string

    - property: content
      external_name: content
      # Type: String Stats: 7 properties of this type in this kind.

    - property: homepageUrl
      external_name: homepageUrl

    - property: date
      external_name: date
      # Type: Date/Time Stats: 7 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

    - property: name
      external_name: name
      # Type: String Stats: 7 properties of this type in this kind.

Note that we’ve added a new property with a new external_name. By default, the loader will use a String.

Now let’s add the field to our CSV file:

content,date,name,key,homepageUrl
Hey it works!,2010-05-18T22:35:17,Ikai Lan,1,http://www.google.com
... (more lines)

We’d likely write a script to augment our CSV file. Note that this only works if we have named keys! If we had integer keys before, we’ll end up creating duplicate entities using key names and not integer IDs.

Now we run the bulkloader to upload our entities:

appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting

Once our loader has finished running, we’ll see the new fields on our existing entities.

WARNING: There is a potential race condition here: if an Entity gets updated by our bulkloader in this fashion right as user facing code reads and updates the Entity without the new field, that will leave us with Entities that were grandfathered incorrectly. Fortunately, after we migrate, we can do a query for these Entities and manually update them. It’s slightly annoying, but far less painful than making bulkloader updates transactional.

Bootstrapping the datastore with default Entities

So we’ve covered the use case of using a generated config.yml file to update or load entities into the datastore, but what we haven’t yet covered is bootstrapping a completely new Entity Kind with never before seen data into the datastore.

Let’s add a new Entity Kind, Employee, to our datastore. We’ll preload this data:

name,title
Ikai Lan,Developer Programs Engineer
Patrick Chanezon,Developer Advocate
Wesley Chun,Developer Programs Engineer
Nick Johnson,Developer Programs Engineer
Jason Cooper,Developer Programs Engineer
Christian Schalk,Developer Advocate
Fred Sauer,Developer Advocate

Note that we didn’t add a key. In this case, we don’t care, so it simplifies our config files. Now let’s take a look at the config.yml we need to use:

python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Employee
  connector: csv

  property_map:

    - property: name
      external_name: name

    - property: title
      external_name: title

Now let’s go ahead and upload these entities:

$ appcfg.py upload_data --config_file=new_entity.yml --filename=new_entity.csv  --url=http://APPID.appspot.com/remote_api --kind=Employee
Uploading data records.
[INFO    ] Logging to bulkloader-log-20100610.151326
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100610.151326.sql3
[INFO    ] Connecting to APPID.appspot.com/remote_api
2010-06-10 15:13:27,334 WARNING appengine_rpc.py:399 ssl module not found.
Without the ssl module, the identity of the remote host cannot be verified, and
connections may NOT be secure. To fix this, please install the ssl module from
http://pypi.python.org/pypi/ssl .
To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl .
Please enter login credentials for APPID.appspot.com
Email: your.email@gmail.com
Password for your.email@gmail.com:
[INFO    ] Starting import; maximum 10 entities per post
.
[INFO    ] 7 entites total, 0 previously transferred
[INFO    ] 7 entities (5394 bytes) transferred in 8.6 seconds
[INFO    ] All entities successfully transferred

Boom! We’re done.

There are still a lot of bulkloader topics to discuss – related entities, entity groups, keys, and so forth. Stay tuned.

Written by Ikai Lan

June 10, 2010 at 2:52 pm

Posted in Uncategorized

Introduction to working with App Engine’s low-level datastore API

with 9 comments

App Engine’s Java SDK ships with three different mechanisms for persisting data:

  • JPA – the javax.persistence.* package
  • JDO – Java Data Objects
  • The low-level API

The formal documentation has got some good examples for working with JDO and JPA, but the documentation for working with the low-level API is still a tad sparse. The original purpose of the low-level API was to provide developers a way to build libraries that could do persistence or even build persistence libraries themselves – alternative persistence mechanisms such as Objectify, Twig, SimpleDS and Slim3 all build on top of this API.

For most developers, it may be simpler to use either JDO, JPA or a third-party library, but there are cases in which the low-level API is useful. This post will be a beginner’s guide to writing and retrieving data using this API – we’ll save more advanced topics for future posts.

For those newer to App Engine, let’s define a few terms before we continue:

Entity – An entity is an object representation of a datastore row. Unlike a row in a relational database, there are no predefined columns. There’s only one giant Bigtable, and your entities are all part of that table.

Entity Kind – There are no tables corresponding to types of data. The Kind of the entity is stored as part of the Entity Key.

Entity Key – The primary way by which entities are fetched – even when you issue queries, the datastore does a batch get by key of entities. It’s similar to a primary key in an RDBMS. The Key encodes your application ID, your Entity’s Kind, any parent entities and other metadata. Description of the key is out of scope of this article, but you’ll be able to find plenty of content about Keys when you refer to your favorite search engine.

Properties – Entities don’t have columns – they have properties. A property represents a field of your Entity. For instance, a Person Entity would have a Kind of Person, a Key corresponding to a unique identifier corresponding to their name (for all real world scenarios, this is only true for me, as I’m the only Ikai Lan in the world), and Properties: age, height, weight, eye color, etc.

There are a lot more terms, but these are the ones we’ll be using frequently in this article. Let’s describe a few key features of the low-level API which differ from using a higher level tool such as the JDO and JPA interfaces. Depending on your point of view, these could be either advantages or disadvantages:

  • Typeless entities. Think of an Entity as a Java class with a Key property (datastore Key), a Kind property (String) and Properties (HashMap of Properties). This means that for a given entity kind, it is possible to have two different entities with completely different properties. You could have a Person entity that defines age and weight as its properties, and a separate Person entity that defines height and eye color.
  • No Managers. You instantiate a DatastoreService from a DatastoreServiceFactory, then you get(), put() and query()*. No need to worry about opening or closing managers, detaching objects, marking items dirty, and so forth.
  • Lower startup time. For lower traffic Java websites, loading a PersistenceManagerFactory or EntityManagerFactory can incur additional startup time cost.

We’ll cover queries in a future post. In this post, we’ll just use get() and put(). In this article, we’ll treat App Engine’s datastore as if it were just a giant Map. Frankly, this isn’t a bad simplication – at its lowest level, Bigtable is a key-value store, which means that the Map abstraction isn’t too far from reality.

Let’s create two Entities representing Persons. We’ll name them Alice and Bob. Let’s define them now:

import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Entity alice = new Entity("Alice", "Person");
alice.setProperty("gender", "female");
alice.setProperty("age", 20);

Key bobKey = KeyFactory.createKey("Person", "Bob");
Entity bob = new Entity(bobKey);
bob.setProperty("gender", "male");
bob.setProperty("age", "23");

What we’ve demonstrated here are two of the basic ways to create entities. Entity contains five constructors. We’re just demonstrating two of them here.

We’re defining Alice with a raw constructor. We’re passing two Strings: her key name as well as her kind. As we mentioned before – Entities are typeless, and we can specify just about any String as her type. Effectively, this means that the number of kinds we can have is limited only by the number of kinds that we need, and as long as we don’t lose track of them, we could potentially have hundreds of different kinds without having to create a class for each one. We could even define new kinds at runtime, if we so dared. The key name is what we’ll use to retrieve Alice later on when we need her again. Think of it as a Map or Dictionary Key. Once we have an Entity object, we need to define her properties. For now, we’ll define her gender and her age. Note that, again, Properties behave like Maps, and this means that not only can Entities have hundreds of types of different properties, we could also create new properties at runtime at the expense of compiler type-safety. Choose your poison carefully.

We’re creating Bob’s instance a bit differently, but not too differently. Using KeyFactory’s static createKey method, we create a Key instance. Note the constructor arguments – they are exactly the same: a kind and a key name. In our simple example, this doesn’t really give us any additional benefits except for adding an additional line of code, but more advanced usages in which we may want to create an Entity with a parent, this technique may result in more clear code. And again – we set Bob’s properties using something similar to a Map.

If you’ve been reading Entity’s Javadoc or following along in your IDE, you’ve probably realized by now that Entity does not contain setKey() or setKind() methods. This is because an Entity’s key is immutable. Once an Entity has a key, it can never be changed. You cannot retrieve an Entity from the datastore and change its key – you must create a new Entity with a new Key and delete the old Entity. This is also true of Entities instantiated in local memory.

Speaking of unsaved Entities, let’s go ahead and save them now. We’ll create an instance of the Datastore client and save Alice and Bob:

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Entity alice = new Entity("Person", "Alice");
alice.setProperty("gender", "female");
alice.setProperty("age", 20);

Key bobKey = KeyFactory.createKey("Person", "Bob");
Entity bob = new Entity(bobKey);
bob.setProperty("gender", "male");
bob.setProperty("age", "23");

DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
datastore.put(alice);
datastore.put(bob);

That’s it! DatastoreService’s put() method returns a Key that we can use.

Now let’s demonstrate retrieving Alice and Bob by Key from another class:

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.EntityNotFoundException;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Key bobKey = KeyFactory.createKey("Person", "Bob");
Key aliceKey = KeyFactory.createKey("Person", "Alice");

DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
Entity alice, bob;

try {
    alice = datastore.get(aliceKey);
    bob = datastore.get(bobKey);

    Long aliceAge = (Long) alice.getProperty("age");
    Long bobAge = (Long) bob.getProperty("age");
    System.out.println(”Alice’s age: “ + aliceAge);
    System.out.println(”Bob’s age: “ + bobAge);
} catch (EntityNotFoundException e) {
    // Alice or Bob doesn't exist!
}

The DatastoreService instance’s get() method takes a Key parameter; this is the same parameter we used earlier to construct the Entity representing Bob! This methods throws an EntityNotFoundException. We retrieve individual properties using the Entity class’s getProperty() method – in the case of age, we cast this to a Long.

So there you have it: the basics of working with the low-level API. I’ll likely add more articles in the future about queries, transactions, and more advanced things you can do.

Written by Ikai Lan

June 3, 2010 at 6:46 pm

JRuby In-Memory Search Example With Lucene 3.0.1

with 2 comments

Just for giggles I decided to port the In-Memory search example from my last blog post to JRuby. It’s been some time since I’ve used JRuby for anything, but the team has still been hard at work making strides towards better Java interoperability and ease of use. I downloaded JRuby 1.5.0_RC1, pointed my PATH to the /bin directory, and began hacking.

I’m incredibly impressed with the level of Java interop and startup speed improvements. Kudos to the JRuby team. Integrating Java couldn’t have been easier.

The example is below. Run it with the command:


jruby -r /path/to/lucene-core-3.0.1.jar inmemory.rb

require 'java'
# You either use the next line by require the JAR file, or you pass
# the -r flag to JRuby as follows:
# jruby -r /path/to/lucene-core-3.0.1.jar inmemory.rb 
# require "lucene-core-3.0.1.jar"

java_import org.apache.lucene.analysis.standard.StandardAnalyzer
java_import org.apache.lucene.document.Document
java_import org.apache.lucene.document.Field
java_import org.apache.lucene.index.IndexWriter
java_import org.apache.lucene.queryParser.ParseException
java_import org.apache.lucene.queryParser.QueryParser
java_import org.apache.lucene.store.RAMDirectory
java_import org.apache.lucene.util.Version

java_import org.apache.lucene.search.IndexSearcher
java_import org.apache.lucene.search.TopScoreDocCollector


def create_document(title, content)
  doc = Document.new
  doc.add Field.new("title", title, Field::Store::YES, Field::Index::NO)
  doc.add Field.new("content", content, Field::Store::YES, Field::Index::ANALYZED)  
  doc
end

def create_index
  idx     = RAMDirectory.new
  writer  = IndexWriter.new(idx, StandardAnalyzer.new(Version::LUCENE_30), IndexWriter::MaxFieldLength::LIMITED)

  writer.add_document(create_document("Theodore Roosevelt",
          "It behooves every man to remember that the work of the " +
                  "critic, is of altogether secondary importance, and that, " +
                  "in the end, progress is accomplished by the man who does " +
                  "things."))
  writer.add_document(create_document("Friedrich Hayek",
          "The case for individual freedom rests largely on the " +
                  "recognition of the inevitable and universal ignorance " +
                  "of all of us concerning a great many of the factors on " +
                  "which the achievements of our ends and welfare depend."))
  writer.add_document(create_document("Ayn Rand",
          "There is nothing to take a man's freedom away from " +
                  "him, save other men. To be free, a man must be free " +
                  "of his brothers."))
  writer.add_document(create_document("Mohandas Gandhi",
          "Freedom is not worth having if it does not connote " +
                  "freedom to err."))

  writer.optimize
  writer.close
  idx
end

def search(searcher, query_string)
  parser = QueryParser.new(Version::LUCENE_30, "content", StandardAnalyzer.new(Version::LUCENE_30))
  query = parser.parse(query_string)
  
  hits_per_page = 10
  
  collector = TopScoreDocCollector.create(5 * hits_per_page, false)
  searcher.search(query, collector)
  
  # Notice how this differs from the Java version: JRuby automagically translates
  # underscore_case_methods into CamelCaseMethods, but scoreDocs is not a method:
  # it's a field. That's why we have to use CamelCase here, otherwise JRuby would
  # complain that score_docs is an undefined method.
  hits = collector.top_docs.scoreDocs
  
  hit_count = collector.get_total_hits
    
  if hit_count.zero?
    puts "No matching documents."
  else
    puts "%d total matching documents" % hit_count
    
    puts "Hits for %s were found in quotes by:" % query_string
    
    hits.each_with_index do |score_doc, i|
      doc_id = score_doc.doc
      doc_score = score_doc.score
      
      puts "doc_id: %s \t score: %s" % [doc_id, doc_score]
      
      doc = searcher.doc(doc_id)
      puts "%d. %s" % [i, doc.get("title")]
      puts "Content: %s" % doc.get("content")
      puts
      
    end
    
  end

end

def main
  index = create_index
  searcher = IndexSearcher.new(index)

  search(searcher, "freedom")
  search(searcher, "free");
  search(searcher, "progress or achievements");
  search(searcher, "ikaisays.com")

  searcher.close
end

main()

Written by Ikai Lan

April 25, 2010 at 7:49 pm

Posted in JRuby, JRuby, Ruby, Software Development

Tagged with , ,

Lucene In-Memory Search Example: Now updated for Lucene 3.0.1

with 3 comments

Update: Here’s a link to some sample code for Python using PyLucene. Thanks, Joseph!

While playing around with Lucene in my experiments to make it work with Google App Engine, I found an excellent example for indexing some text using Lucene in-memory; unfortunately, it dates back to May 2004 (!!!). I’ve updated the example to work with the newest version of Lucene, 3.0.1. It’s below for reference.

The Pastie link for the code snippet can be found here.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import java.io.IOException;

public class LuceneTest{

   public static void main(String[] args) {
      // Construct a RAMDirectory to hold the in-memory representation
      // of the index.
      RAMDirectory idx = new RAMDirectory();

      try {
         // Make an writer to create the index
         IndexWriter writer =
                 new IndexWriter(idx, 
                         new StandardAnalyzer(Version.LUCENE_30), 
                         IndexWriter.MaxFieldLength.LIMITED);

         // Add some Document objects containing quotes
         writer.addDocument(createDocument("Theodore Roosevelt",
                 "It behooves every man to remember that the work of the " +
                         "critic, is of altogether secondary importance, and that, " +
                         "in the end, progress is accomplished by the man who does " +
                         "things."));
         writer.addDocument(createDocument("Friedrich Hayek",
                 "The case for individual freedom rests largely on the " +
                         "recognition of the inevitable and universal ignorance " +
                         "of all of us concerning a great many of the factors on " +
                         "which the achievements of our ends and welfare depend."));
         writer.addDocument(createDocument("Ayn Rand",
                 "There is nothing to take a man's freedom away from " +
                         "him, save other men. To be free, a man must be free " +
                         "of his brothers."));
         writer.addDocument(createDocument("Mohandas Gandhi",
                 "Freedom is not worth having if it does not connote " +
                         "freedom to err."));

         // Optimize and close the writer to finish building the index
         writer.optimize();
         writer.close();

         // Build an IndexSearcher using the in-memory index
         Searcher searcher = new IndexSearcher(idx);

         // Run some queries
         search(searcher, "freedom");
         search(searcher, "free");
         search(searcher, "progress or achievements");

         searcher.close();
      }
      catch (IOException ioe) {
         // In this example we aren't really doing an I/O, so this
         // exception should never actually be thrown.
         ioe.printStackTrace();
      }
      catch (ParseException pe) {
         pe.printStackTrace();
      }
   }

   /**
    * Make a Document object with an un-indexed title field and an
    * indexed content field.
    */
   private static Document createDocument(String title, String content) {
      Document doc = new Document();

      // Add the title as an unindexed field...

      doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));


      // ...and the content as an indexed field. Note that indexed
      // Text fields are constructed using a Reader. Lucene can read
      // and index very large chunks of text, without storing the
      // entire content verbatim in the index. In this example we
      // can just wrap the content string in a StringReader.
      doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));

      return doc;
   }

   /**
    * Searches for the given string in the "content" field
    */
   private static void search(Searcher searcher, String queryString)
           throws ParseException, IOException {

      // Build a Query object
      QueryParser parser = new QueryParser(Version.LUCENE_30, 
              "content", 
              new StandardAnalyzer(Version.LUCENE_30));
      Query query = parser.parse(queryString);


      int hitsPerPage = 10;
      // Search for the query
      TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);
      searcher.search(query, collector);

      ScoreDoc[] hits = collector.topDocs().scoreDocs;

      int hitCount = collector.getTotalHits();
      System.out.println(hitCount + " total matching documents");

      // Examine the Hits object to see if there were any matches

      if (hitCount == 0) {
         System.out.println(
                 "No matches were found for \"" + queryString + "\"");
      } else {
         System.out.println("Hits for \"" +
                 queryString + "\" were found in quotes by:");

         // Iterate over the Documents in the Hits object
         for (int i = 0; i &lt; hitCount; i++) {
            ScoreDoc scoreDoc = hits[i];
            int docId = scoreDoc.doc;
            float docScore = scoreDoc.score;
            System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);

            Document doc = searcher.doc(docId);

            // Print the value that we stored in the "title" field. Note
            // that this Field was not indexed, but (unlike the
            // "contents" field) was stored verbatim and can be
            // retrieved.
            System.out.println("  " + (i + 1) + ". " + doc.get("title"));
            System.out.println("Content: " + doc.get("content"));            
         }
      }
      System.out.println();
   }
}

In progress: still trying to figure out how to get some version of Lucene working on App Engine for Java. My thoughts:

  • Use an In Memory index
  • Serialize to Memcache or the Datastore (not even sure how to do this right now)

Granted, there are limitations to this: if an App Engine application exceeds some memory limit, a SoftMemoryExceeded exception will be thrown. Also – I’m doubtful of the ability to update indexes incrementally in the datastore: not to mention, there’s a 1mb limit on datastore entries. The Blobstore, accessed programmatically, may not have the latency required. Still – it’s an interesting thought experiment, and there’s probably some compromise we can find with a future feature of App Engine that’ll allow us to make Lucene actually usable. We just have to think of it. Stay tuned. I’ll write another post if I can get even a proof-of-concept to work.

Written by Ikai Lan

April 24, 2010 at 8:32 am