Ikai Lan says

I say things!

Archive for June 2010

Using Asynchronous URLFetch on Java App Engine

with 9 comments

Developers building applications on top of Java App Engine can use the familiar java.net interface for making off-network calls. For simple requests, this should be more than sufficient. The low-level API, however, provides one feature not available in java.net: asynchronous URLFetch.

The low-level URLFetch API

So what does the low-level API look like? Something like this:

import java.io.IOException;
import java.net.URL;
import java.util.List;

import com.google.appengine.api.urlfetch.HTTPHeader;
import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		try {
			URL url = new URL("http://someurl.com");
			HTTPResponse response = fetcher.fetch(url);

			byte[] content = response.getContent();
                     // if redirects are followed, this returns the final URL we are redirected to
			URL finalUrl = response.getFinalUrl();

			// 200, 404, 500, etc
			int responseCode = response.getResponseCode();
			List headers = response.getHeaders();

			for(HTTPHeader header : headers) {
				String headerName = header.getName();
				String headerValue = header.getValue();
			}

		} catch (IOException e) {
			// new URL throws MalformedUrlException, which is impossible for us here
		}

The full Javadocs are here.

So it’s a bit different than the standard java.net interface, where we’d get back a reader and iterate line by line on the response. We’re protected from a heap overflow because URLFetch is limited to 1mb responses.

Asynchronous vs. Synchronous requests

Using java.net has the advantage of portability – you could build a standard fetcher that will work in any JVM environment, even those outside of App Engine. The tradeoff, of course, is that App Engine specific features won’t be present. The one killer feature of App Engine’s low-level API that isn’t present in java.net is asynchronous URLFetch. What is asynchronous fetch? Let’s make an analogy:

Let’s pretend you, like me at home, are on DSL and have a pretty pathetic downstream speed, and decide to check out a link sharing site like Digg. You browse to the front page and decide to open up every single link. You can do this synchronously or asynchronously.

Synchronously

Synchronously, you click link #1. Now you look at this page. When you are done looking at this page, you hit the back button and click link #2. Repeat until you have seen all the pages. Now, again, you are on DSL, so not only do you spend time reading each link, before you read each destination page, you have to wait for the page to load. This can take a significant portion of your time. The total amount of time you sit in front of your computer is thus:

N = number of links
L = loading time per page
R = reading time per page

N * (L + R)

(Yes, before I wrote this equation, I thought that by including some mathematical formulas in my blog post would make me look smarter, but as it turns out the equation is something comprehensible by 8 year olds internationally/American public high school students)

Asynchronously

Using a tabbed browser, you control-click every single link on the page to open them in new tabs. Now before you go look at any of the pages, you decide to go to the kitchen and make a grilled cheese sandwich. When you are done with the sandwich, you come back to your computer sit down, enjoying your nice, toasty sandwich while you read articles about Ron Paul and look at funny pictures of cats. How much time are you spending?

N = number of links
S = loading time for the slowest loading page
R = reading time per page
G = time to make a grilled cheese sandwich
MAX((N * R + G), (N * R + S))

Which takes longer, your DSL, or the time it takes you to make a grilled cheese sandwich? The point that I’m making here is that you can save time by parallelizing things. No, I know it’s not a perfect analogy, as downloading N pages in parallel consumes the same crappy DSL connection, but you get what I am trying to say. Hopefully. And maybe you are also in the mood for some cheese.

Asynchrous URLFetch in App Engine

So what would the URLFetch above look like asynchronously? Probably something like this:

import java.io.IOException;
import java.net.URL;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

import com.google.appengine.api.urlfetch.HTTPHeader;
import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

	protected void makeAsyncRequest() {
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		try {
			URL url = new URL("http://someurl.com");
			Future future = fetcher.fetchAsync(url);

			// Other stuff can happen here!

			HTTPResponse response = future.get();
			byte[] content = response.getContent();
			URL finalUrl = response.getFinalUrl();
			int responseCode = response.getResponseCode();
			List headers = response.getHeaders();

			for(HTTPHeader header : headers) {
				String headerName = header.getName();
				String headerValue = header.getValue();
			}

		} catch (IOException e) {
			// new URL throws MalformedUrlException, which is impossible for us here
		} catch (InterruptedException e) {
			// Exception from using java.concurrent.Future
		} catch (ExecutionException e) {
			// Exception from using java.concurrent.Future
			e.printStackTrace();
		}

	}

This looks pretty similar – EXCEPT: fetchAsync doesn’t return an HTTPResponse. It returns a Future. What is a future?

java.concurrent.Future

From the Javadocs:

“A Future represents the result of an asynchronous computation. Methods are provided to check if the computation is complete, to wait for its completion, and to retrieve the result of the computation. The result can only be retrieved using method get when the computation has completed, blocking if necessary until it is ready. Cancellation is performed by thecancel method. Additional methods are provided to determine if the task completed normally or was cancelled.”

Future

What does this mean in English? It means that the Future object is NOT the response of the HTTP call. We can’t get the response until we call the get() method. Between when we call fetchAsync() and get, we can do other stuff: datastore operations, insert things into the Task Queue, heck, we can even do more URLFetch operations. When we finally DO call get(), one of two things happens:

  1. We’ve already retrieved the URL. Return an HTTPResponse object
  2. We’re still retrieving the URL. Block until we are done, then return an HTTPResponse object.

In the best case scenario, the amount of time it takes for us to do other things is equal to the amount of time it takes to do the URLFetch, and we save time. In the worst case scenario, the maximum amount of time is the time it takes to do the URLFetch or the other operations, whichever takes longer. The end result is that we lower the amount of time to return a response to the end-user.

Twitter Example

So let’s build a servlet that retrieves my tweets. Just for giggles, let’s do it 20 times and see what the performance difference is. We’ll make it so that if we pass a URL parameter, async=true (or anything, for simplicity), we do the same operation using fetchAsync. The code is below:

package com.ikai.urlfetchdemo;

import java.io.IOException;
import java.io.PrintWriter;
import java.net.URL;
import java.util.ArrayList;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.google.appengine.api.urlfetch.HTTPResponse;
import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

@SuppressWarnings("serial")
public class GetTwitterFeedServlet extends HttpServlet {

	protected static String IKAI_TWITTER_RSS = "http://twitter.com/statuses/user_timeline/14437022.rss";

	public void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		boolean isSyncRequest = true;

		if(req.getParameter("async") != null) {
			isSyncRequest = false;
		}

		resp.setContentType("text/html");
		PrintWriter out = resp.getWriter();
		out.println("<h1>Twitter feed fetch demo</h1>");

		long startTime = System.currentTimeMillis();
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		URL url = new URL(IKAI_TWITTER_RSS);

		if(isSyncRequest) {
			out.println("<h2>Synchronous fetch</h2>");
			for(int i = 0; i < 20; i++) {
				HTTPResponse response = fetcher.fetch(url);
				printResponse(out, response);
			}
		} else {
			out.println("<h2>Asynchronous fetch</h2>");
			ArrayList<Future<HTTPResponse>> asyncResponses = new ArrayList<Future<HTTPResponse>>();
			for(int i = 0; i < 20; i++) {
				Future<HTTPResponse> responseFuture = fetcher.fetchAsync(url);
				asyncResponses.add(responseFuture);
			}

			for(Future<HTTPResponse> future : asyncResponses){
				HTTPResponse response;
				try {
					response = future.get();
					printResponse(out, response);
				} catch (InterruptedException e) {
					// Guess you would do something here
				} catch (ExecutionException e) {
					// Guess you would do something here
				}
			}

		}

		long totalProcessingTime = System.currentTimeMillis() - startTime;
		out.println("<p>Total processing time: " + totalProcessingTime + "ms</p>");
	}

	private void printResponse(PrintWriter out, HTTPResponse response) {
		out.println("<p>");
		out.println("Response: " + new String(response.getContent()));
		out.println("</p>");
	}

}

As you can see, it’s a bit more involved to store all the Futures in a list, then to iterate through them. We’re also not being too intelligent about iterating through the futures: we’re assuming first-in-first-out (FIFO) with URLFetch, which may or may not be the case in production. A more optimized case may try to fetch the response from a call we know is faster before fetching from one we know is slower first. However – empirical testing will show that more often than not, doing things asynchronously is significantly faster for the user than synchronously.

Using Asynchronous URLFetch and HTTP POST

So far, our examples have been focused on read operations. What if we don’t care about the response? For instance, what if we decide to make an API call that essentially is a “write” operation, and can, for the most part, safely assume it will succeed?

// JavaAsyncUrlFetchDemoServlet.java
package com.ikai.urlfetchdemo;

import java.io.IOException;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.google.appengine.api.urlfetch.URLFetchService;
import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

@SuppressWarnings("serial")
public class JavaAsyncUrlFetchDemoServlet extends HttpServlet {

	public void doGet(HttpServletRequest req, HttpServletResponse resp)
			throws IOException {

		long startTime = System.currentTimeMillis();
		URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
		fetcher.fetchAsync(FetchHelper.makeGuestbookPostRequest("Async", "At" + startTime));
		long totalProcessingTime = System.currentTimeMillis() - startTime;

		resp.setContentType("text/html");
		resp.getWriter().println("<h1>Asynchronous fetch demo</h1>");
		resp.getWriter().println("<p>Total processing time: " + totalProcessingTime + "ms</p>");
	}

}
// FetchHelper.java
package com.ikai.urlfetchdemo;

import java.net.MalformedURLException;
import java.net.URL;

import com.google.appengine.api.urlfetch.HTTPMethod;
import com.google.appengine.api.urlfetch.HTTPRequest;

public class FetchHelper {

	protected static final String signGuestBookUrl = //"http://bootcamp-demo.appspot.com/sign";

	public static HTTPRequest makeGuestbookPostRequest(String name, String content){
		HTTPRequest request = null;
		URL url;
		try {
			url = new URL(signGuestBookUrl);
			request = new HTTPRequest(url, HTTPMethod.POST);
			String body = "name=" + name + "&amp;content=" + content;
			request.setPayload(body.getBytes());

		} catch (MalformedURLException e) {
			// Do nothing
		}
		return request;
	}
}

I’ve decided to spam my own guestbook here, for better or for worse.

Download the code

You can download the code from this post here using git: http://github.com/ikai/Java-App-Engine-Async-URLFetch-Demo

Advertisements

Written by Ikai Lan

June 29, 2010 at 2:49 pm

Posted in Uncategorized

Using the bulkloader with Java App Engine

with 32 comments

The latest release of the datastore bulkloader greatly simplifies import and export of data from App Engine applications for developers. We’ll go through a step by step example for using this tool with a Java application. Note that only setting up Remote API is Java specific – everything can be used with Python applications. Unlike certain phone companies, this is one store that doesn’t care what language your application is written in.

Checking for our Prerequisites:

If you already have Python 2.5.x and the Python SDK installed, skip this section.

First off, we’ll need to download the Python SDK. This example assumes we have Python version 2.5.x installed. If you’re not sure what version you have installed, open up a terminal and type “python”. This opens up a Python REPL, with the first line displaying the version of Python you’re using. Here’s example output:

Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

(Yes, Pythonistas, the version on my laptop is ooooooooold).

Download the Python SDK from the following link. As of the writing of this post, the newest version is 1.3.4: Direct link.

Unzip this file. It’ll be easier for you if you place this in your path. Linux and OS X users will append this in their ~/.bash_profile:

PATH="/path/to/where/you/unzipped/appengine:${PATH}"
export PATH

To test that everything is working, type

appcfg.py

You’ll see a page of usage commands that starts out like this:

Usage: appcfg.py [options] <action>

Action must be one of:
create_bulkloader_config: Create a bulkloader.yaml from a running application.
cron_info: Display information about cron jobs.
download_data: Download entities from datastore.
help: Print help for a specific action.
request_logs: Write request logs in Apache common log format.
rollback: Rollback an in-progress update.
update: Create or update an app version.
update_cron: Update application cron definitions.
update_dos: Update application dos definitions.
update_indexes: Update application indexes.
update_queues: Update application task queue definitions.
upload_data: Upload data records to datastore.
vacuum_indexes: Delete unused indexes from application.
Use 'help <action>' for a detailed description.

…. (and so forth)

Now we can go ahead and start using the bulkloader.

Using the bulkloader for Java applications

Before we can begin using the bulkloader, we’ll have to set it up first. Setting up the bulkloader is a three step process. We’ll need to:

1. Add RemoteApi to mapping
2. Generate a bulkloader configuration

Step 1: Add RemoteApi to our URI mapping

We’ll want to edit our web.xml. Add the following lines:

    <servlet>
        <servlet-name>RemoteApi</servlet-name>
        <servlet-class>com.google.apphosting.utils.remoteapi.RemoteApiServlet</servlet-class>
    </servlet>

    <servlet-mapping>
        <servlet-name>RemoteApi</servlet-name>
        <url-pattern>/remote_api</url-pattern>
    </servlet-mapping>

A common pitful with setting up RemoteApi is that developers using frameworks will use a catch-all expression for mapping URIs, and this will stomp over our servlet-mapping. Deploy this application into production. We’ll likely want to put an admin constraint on this.

Step 2: Generate a bulkloader configuration

This step isn’t actually *required*, but it certainly makes our lives easier, especially if we are looking to export existing data. In a brand new application, if we are looking to bootstrap our application with data, we don’t need this step at all. For completeness, however, it’d be best to go over it.

We’ll need to generate a configuration template. This step depends on datastore statistics having been updated with the Entities we’re looking to export. Log in to appspot.com and click “Datastore Statistics” under Datastore in the right hand menu.

If we see something that looks like the following screenshot, we can use this tool.

If we see something that looks like the screenshow below, then we can’t autogenerate a configuration since this is a brand new application – that’s okay, that means we probably don’t have much data to export. We’ll have to wait for App Engine’s background tasks to bulk update our statistics before we’ll be able to complete this step.

Assuming that we have datastore statistics available, we can use appcfg.py in the following manner to generate a configuration file:

appcfg.py create_bulkloader_config --url=http://APPID.appspot.com/remote_api --application=APPID --filename=config.yml

If the datastore isn’t ready, running this command will cause the following error:

[ERROR   ] Unable to download kind stats for all-kinds download.
[ERROR   ] Kind stats are generated periodically by the appserver
[ERROR   ] Kind stats are not available on dev_appserver.

I’m using this on a Guestbook sample application I wrote for a codelab a while ago. The only Entities are Greetings, which consists of a String username, a String comment and a timestamp. This is what my config file looks like:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Greeting
  connector:

  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      import_transform: transform.key_id_or_name_as_string

    - property: content
      external_name: content
      # Type: String Stats: 7 properties of this type in this kind.

    - property: date
      external_name: date
      # Type: Date/Time Stats: 7 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

    - property: name
      external_name: name
      # Type: String Stats: 7 properties of this type in this kind.

We care about the connector. Replace that with the following:

- kind: Greeting
    connector: csv

We’ve only filled in the “connector” option. Now we have something we can use to dump data.

Examples of common usages of the bulkloader

Downloading data

We’ve got what we need to dump data. Let’s go ahead and do that now. Issue the following command:

appcfg.py download_data --config_file=config.yml --filename=data.csv --kind=Greeting --url=http://APPID.appspot.com/remote_api --application=APPID

We’ll be asked to provide our email and password credentials. Here’s what my console output looks like:

Downloading data records.
[INFO    ] Logging to bulkloader-log-20100609.162353
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100609.162353.sql3
[INFO    ] Opening database: bulkloader-results-20100609.162353.sql3
[INFO    ] Connecting to java.latest.bootcamp-demo.appspot.com/remote_api
2010-06-09 16:23:57,022 WARNING appengine_rpc.py:399 ssl module not found.
Without the ssl module, the identity of the remote host cannot be verified, and
connections may NOT be secure. To fix this, please install the ssl module from
http://pypi.python.org/pypi/ssl .
To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl .
Please enter login credentials for java.latest.bootcamp-demo.appspot.com
Email: YOUR EMAIL
Password for YOUR_EMAIL:
[INFO    ] Downloading kinds: ['Greeting']
.[INFO    ] Greeting: No descending index on __key__, performing serial download
.
[INFO    ] Have 17 entities, 0 previously transferred
[INFO    ] 17 entities (11304 bytes) transferred in 10.5 seconds

There’s now a CSV file named data.csv in my directory, as well as a bunch of autogenerated bulkloader-* files for resuming if the loader dies midway during the export. My CSV file starts like this:

content,date,name,key
Hey it works!,2010-05-18T22:35:17,Ikai Lan,1
… (More lines of CSV)

The first line is a labeling line – this line designates the order in which properties have been exported. In our case, we’ve exported content, date and name in addition to Entity keys.

Uploading Data

To upload the CSV file back into the datastore, we run the following command:

appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting

This’ll use config.yml and create our entities in the remote datastore.

Adding a new field to datastore entities

One question that is frequently asked in the groups is, “How do I migrate my schema?” This question is generally poorly phrased; App Engine’s datastore is schemaless. That is – it is possible to have Entities of the same Kind with completely different sets of properties. Most of the time, this is a good thing. MySQL, for instance, requires a table lock to do a schema update. By being schema free, migrations can happen lazily, and application developers can check at runtime for whether a Property exists on a given Entity, then create or set the value as needed.

But there are times when this isn’t sufficient. One use case is if we want to change a default value on Entities and grandfather older Entities to the new default value, but we also want the default value to possibly be null. We can do tricks such as creating a new Property, setting an update timestamp, checking for whether the update timestamp is before or after when we made the code change and update conditionally, and so forth. The problem with this approach is that it introduces a TON of complexity into our application, and if we have more than one of these “migrations”, suddenly we’re writing more code to lazily grandfather data and confusing the non-Cylons that work on our team. It’s easier to migrate all the data. So how we do this? Before the new application code goes live, we migrate the schema by adding the new field. The best part about this is that we can do this without locking tables, so writes can continue.

Let’s add a new String field to our Greeting class: homepageUrl. Let’s assume that we want to set a default to http://www.google.com. How would we do this? Let’s update our config.yml file to the following:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Greeting
  connector: csv

  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      import_transform: transform.key_id_or_name_as_string

    - property: content
      external_name: content
      # Type: String Stats: 7 properties of this type in this kind.

    - property: homepageUrl
      external_name: homepageUrl

    - property: date
      external_name: date
      # Type: Date/Time Stats: 7 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

    - property: name
      external_name: name
      # Type: String Stats: 7 properties of this type in this kind.

Note that we’ve added a new property with a new external_name. By default, the loader will use a String.

Now let’s add the field to our CSV file:

content,date,name,key,homepageUrl
Hey it works!,2010-05-18T22:35:17,Ikai Lan,1,http://www.google.com
... (more lines)

We’d likely write a script to augment our CSV file. Note that this only works if we have named keys! If we had integer keys before, we’ll end up creating duplicate entities using key names and not integer IDs.

Now we run the bulkloader to upload our entities:

appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting

Once our loader has finished running, we’ll see the new fields on our existing entities.

WARNING: There is a potential race condition here: if an Entity gets updated by our bulkloader in this fashion right as user facing code reads and updates the Entity without the new field, that will leave us with Entities that were grandfathered incorrectly. Fortunately, after we migrate, we can do a query for these Entities and manually update them. It’s slightly annoying, but far less painful than making bulkloader updates transactional.

Bootstrapping the datastore with default Entities

So we’ve covered the use case of using a generated config.yml file to update or load entities into the datastore, but what we haven’t yet covered is bootstrapping a completely new Entity Kind with never before seen data into the datastore.

Let’s add a new Entity Kind, Employee, to our datastore. We’ll preload this data:

name,title
Ikai Lan,Developer Programs Engineer
Patrick Chanezon,Developer Advocate
Wesley Chun,Developer Programs Engineer
Nick Johnson,Developer Programs Engineer
Jason Cooper,Developer Programs Engineer
Christian Schalk,Developer Advocate
Fred Sauer,Developer Advocate

Note that we didn’t add a key. In this case, we don’t care, so it simplifies our config files. Now let’s take a look at the config.yml we need to use:

python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Employee
  connector: csv

  property_map:

    - property: name
      external_name: name

    - property: title
      external_name: title

Now let’s go ahead and upload these entities:

$ appcfg.py upload_data --config_file=new_entity.yml --filename=new_entity.csv  --url=http://APPID.appspot.com/remote_api --kind=Employee
Uploading data records.
[INFO    ] Logging to bulkloader-log-20100610.151326
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100610.151326.sql3
[INFO    ] Connecting to APPID.appspot.com/remote_api
2010-06-10 15:13:27,334 WARNING appengine_rpc.py:399 ssl module not found.
Without the ssl module, the identity of the remote host cannot be verified, and
connections may NOT be secure. To fix this, please install the ssl module from
http://pypi.python.org/pypi/ssl .
To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl .
Please enter login credentials for APPID.appspot.com
Email: your.email@gmail.com
Password for your.email@gmail.com:
[INFO    ] Starting import; maximum 10 entities per post
.
[INFO    ] 7 entites total, 0 previously transferred
[INFO    ] 7 entities (5394 bytes) transferred in 8.6 seconds
[INFO    ] All entities successfully transferred

Boom! We’re done.

There are still a lot of bulkloader topics to discuss – related entities, entity groups, keys, and so forth. Stay tuned.

Written by Ikai Lan

June 10, 2010 at 2:52 pm

Posted in Uncategorized

Introduction to working with App Engine’s low-level datastore API

with 9 comments

App Engine’s Java SDK ships with three different mechanisms for persisting data:

  • JPA – the javax.persistence.* package
  • JDO – Java Data Objects
  • The low-level API

The formal documentation has got some good examples for working with JDO and JPA, but the documentation for working with the low-level API is still a tad sparse. The original purpose of the low-level API was to provide developers a way to build libraries that could do persistence or even build persistence libraries themselves – alternative persistence mechanisms such as Objectify, Twig, SimpleDS and Slim3 all build on top of this API.

For most developers, it may be simpler to use either JDO, JPA or a third-party library, but there are cases in which the low-level API is useful. This post will be a beginner’s guide to writing and retrieving data using this API – we’ll save more advanced topics for future posts.

For those newer to App Engine, let’s define a few terms before we continue:

Entity – An entity is an object representation of a datastore row. Unlike a row in a relational database, there are no predefined columns. There’s only one giant Bigtable, and your entities are all part of that table.

Entity Kind – There are no tables corresponding to types of data. The Kind of the entity is stored as part of the Entity Key.

Entity Key – The primary way by which entities are fetched – even when you issue queries, the datastore does a batch get by key of entities. It’s similar to a primary key in an RDBMS. The Key encodes your application ID, your Entity’s Kind, any parent entities and other metadata. Description of the key is out of scope of this article, but you’ll be able to find plenty of content about Keys when you refer to your favorite search engine.

Properties – Entities don’t have columns – they have properties. A property represents a field of your Entity. For instance, a Person Entity would have a Kind of Person, a Key corresponding to a unique identifier corresponding to their name (for all real world scenarios, this is only true for me, as I’m the only Ikai Lan in the world), and Properties: age, height, weight, eye color, etc.

There are a lot more terms, but these are the ones we’ll be using frequently in this article. Let’s describe a few key features of the low-level API which differ from using a higher level tool such as the JDO and JPA interfaces. Depending on your point of view, these could be either advantages or disadvantages:

  • Typeless entities. Think of an Entity as a Java class with a Key property (datastore Key), a Kind property (String) and Properties (HashMap of Properties). This means that for a given entity kind, it is possible to have two different entities with completely different properties. You could have a Person entity that defines age and weight as its properties, and a separate Person entity that defines height and eye color.
  • No Managers. You instantiate a DatastoreService from a DatastoreServiceFactory, then you get(), put() and query()*. No need to worry about opening or closing managers, detaching objects, marking items dirty, and so forth.
  • Lower startup time. For lower traffic Java websites, loading a PersistenceManagerFactory or EntityManagerFactory can incur additional startup time cost.

We’ll cover queries in a future post. In this post, we’ll just use get() and put(). In this article, we’ll treat App Engine’s datastore as if it were just a giant Map. Frankly, this isn’t a bad simplication – at its lowest level, Bigtable is a key-value store, which means that the Map abstraction isn’t too far from reality.

Let’s create two Entities representing Persons. We’ll name them Alice and Bob. Let’s define them now:

import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Entity alice = new Entity("Alice", "Person");
alice.setProperty("gender", "female");
alice.setProperty("age", 20);

Key bobKey = KeyFactory.createKey("Person", "Bob");
Entity bob = new Entity(bobKey);
bob.setProperty("gender", "male");
bob.setProperty("age", "23");

What we’ve demonstrated here are two of the basic ways to create entities. Entity contains five constructors. We’re just demonstrating two of them here.

We’re defining Alice with a raw constructor. We’re passing two Strings: her key name as well as her kind. As we mentioned before – Entities are typeless, and we can specify just about any String as her type. Effectively, this means that the number of kinds we can have is limited only by the number of kinds that we need, and as long as we don’t lose track of them, we could potentially have hundreds of different kinds without having to create a class for each one. We could even define new kinds at runtime, if we so dared. The key name is what we’ll use to retrieve Alice later on when we need her again. Think of it as a Map or Dictionary Key. Once we have an Entity object, we need to define her properties. For now, we’ll define her gender and her age. Note that, again, Properties behave like Maps, and this means that not only can Entities have hundreds of types of different properties, we could also create new properties at runtime at the expense of compiler type-safety. Choose your poison carefully.

We’re creating Bob’s instance a bit differently, but not too differently. Using KeyFactory’s static createKey method, we create a Key instance. Note the constructor arguments – they are exactly the same: a kind and a key name. In our simple example, this doesn’t really give us any additional benefits except for adding an additional line of code, but more advanced usages in which we may want to create an Entity with a parent, this technique may result in more clear code. And again – we set Bob’s properties using something similar to a Map.

If you’ve been reading Entity’s Javadoc or following along in your IDE, you’ve probably realized by now that Entity does not contain setKey() or setKind() methods. This is because an Entity’s key is immutable. Once an Entity has a key, it can never be changed. You cannot retrieve an Entity from the datastore and change its key – you must create a new Entity with a new Key and delete the old Entity. This is also true of Entities instantiated in local memory.

Speaking of unsaved Entities, let’s go ahead and save them now. We’ll create an instance of the Datastore client and save Alice and Bob:

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Entity alice = new Entity("Person", "Alice");
alice.setProperty("gender", "female");
alice.setProperty("age", 20);

Key bobKey = KeyFactory.createKey("Person", "Bob");
Entity bob = new Entity(bobKey);
bob.setProperty("gender", "male");
bob.setProperty("age", "23");

DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
datastore.put(alice);
datastore.put(bob);

That’s it! DatastoreService’s put() method returns a Key that we can use.

Now let’s demonstrate retrieving Alice and Bob by Key from another class:

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.EntityNotFoundException;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;

Key bobKey = KeyFactory.createKey("Person", "Bob");
Key aliceKey = KeyFactory.createKey("Person", "Alice");

DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
Entity alice, bob;

try {
    alice = datastore.get(aliceKey);
    bob = datastore.get(bobKey);

    Long aliceAge = (Long) alice.getProperty("age");
    Long bobAge = (Long) bob.getProperty("age");
    System.out.println(”Alice’s age: “ + aliceAge);
    System.out.println(”Bob’s age: “ + bobAge);
} catch (EntityNotFoundException e) {
    // Alice or Bob doesn't exist!
}

The DatastoreService instance’s get() method takes a Key parameter; this is the same parameter we used earlier to construct the Entity representing Bob! This methods throws an EntityNotFoundException. We retrieve individual properties using the Entity class’s getProperty() method – in the case of age, we cast this to a Long.

So there you have it: the basics of working with the low-level API. I’ll likely add more articles in the future about queries, transactions, and more advanced things you can do.

Written by Ikai Lan

June 3, 2010 at 6:46 pm