Ikai Lan says

I say things!

Archive for June 10th, 2010

Using the bulkloader with Java App Engine

with 32 comments

The latest release of the datastore bulkloader greatly simplifies import and export of data from App Engine applications for developers. We’ll go through a step by step example for using this tool with a Java application. Note that only setting up Remote API is Java specific – everything can be used with Python applications. Unlike certain phone companies, this is one store that doesn’t care what language your application is written in.

Checking for our Prerequisites:

If you already have Python 2.5.x and the Python SDK installed, skip this section.

First off, we’ll need to download the Python SDK. This example assumes we have Python version 2.5.x installed. If you’re not sure what version you have installed, open up a terminal and type “python”. This opens up a Python REPL, with the first line displaying the version of Python you’re using. Here’s example output:

Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

(Yes, Pythonistas, the version on my laptop is ooooooooold).

Download the Python SDK from the following link. As of the writing of this post, the newest version is 1.3.4: Direct link.

Unzip this file. It’ll be easier for you if you place this in your path. Linux and OS X users will append this in their ~/.bash_profile:

PATH="/path/to/where/you/unzipped/appengine:${PATH}"
export PATH

To test that everything is working, type

appcfg.py

You’ll see a page of usage commands that starts out like this:

Usage: appcfg.py [options] <action>

Action must be one of:
create_bulkloader_config: Create a bulkloader.yaml from a running application.
cron_info: Display information about cron jobs.
download_data: Download entities from datastore.
help: Print help for a specific action.
request_logs: Write request logs in Apache common log format.
rollback: Rollback an in-progress update.
update: Create or update an app version.
update_cron: Update application cron definitions.
update_dos: Update application dos definitions.
update_indexes: Update application indexes.
update_queues: Update application task queue definitions.
upload_data: Upload data records to datastore.
vacuum_indexes: Delete unused indexes from application.
Use 'help <action>' for a detailed description.

…. (and so forth)

Now we can go ahead and start using the bulkloader.

Using the bulkloader for Java applications

Before we can begin using the bulkloader, we’ll have to set it up first. Setting up the bulkloader is a three step process. We’ll need to:

1. Add RemoteApi to mapping
2. Generate a bulkloader configuration

Step 1: Add RemoteApi to our URI mapping

We’ll want to edit our web.xml. Add the following lines:

    <servlet>
        <servlet-name>RemoteApi</servlet-name>
        <servlet-class>com.google.apphosting.utils.remoteapi.RemoteApiServlet</servlet-class>
    </servlet>

    <servlet-mapping>
        <servlet-name>RemoteApi</servlet-name>
        <url-pattern>/remote_api</url-pattern>
    </servlet-mapping>

A common pitful with setting up RemoteApi is that developers using frameworks will use a catch-all expression for mapping URIs, and this will stomp over our servlet-mapping. Deploy this application into production. We’ll likely want to put an admin constraint on this.

Step 2: Generate a bulkloader configuration

This step isn’t actually *required*, but it certainly makes our lives easier, especially if we are looking to export existing data. In a brand new application, if we are looking to bootstrap our application with data, we don’t need this step at all. For completeness, however, it’d be best to go over it.

We’ll need to generate a configuration template. This step depends on datastore statistics having been updated with the Entities we’re looking to export. Log in to appspot.com and click “Datastore Statistics” under Datastore in the right hand menu.

If we see something that looks like the following screenshot, we can use this tool.

If we see something that looks like the screenshow below, then we can’t autogenerate a configuration since this is a brand new application – that’s okay, that means we probably don’t have much data to export. We’ll have to wait for App Engine’s background tasks to bulk update our statistics before we’ll be able to complete this step.

Assuming that we have datastore statistics available, we can use appcfg.py in the following manner to generate a configuration file:

appcfg.py create_bulkloader_config --url=http://APPID.appspot.com/remote_api --application=APPID --filename=config.yml

If the datastore isn’t ready, running this command will cause the following error:

[ERROR   ] Unable to download kind stats for all-kinds download.
[ERROR   ] Kind stats are generated periodically by the appserver
[ERROR   ] Kind stats are not available on dev_appserver.

I’m using this on a Guestbook sample application I wrote for a codelab a while ago. The only Entities are Greetings, which consists of a String username, a String comment and a timestamp. This is what my config file looks like:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Greeting
  connector:

  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      import_transform: transform.key_id_or_name_as_string

    - property: content
      external_name: content
      # Type: String Stats: 7 properties of this type in this kind.

    - property: date
      external_name: date
      # Type: Date/Time Stats: 7 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

    - property: name
      external_name: name
      # Type: String Stats: 7 properties of this type in this kind.

We care about the connector. Replace that with the following:

- kind: Greeting
    connector: csv

We’ve only filled in the “connector” option. Now we have something we can use to dump data.

Examples of common usages of the bulkloader

Downloading data

We’ve got what we need to dump data. Let’s go ahead and do that now. Issue the following command:

appcfg.py download_data --config_file=config.yml --filename=data.csv --kind=Greeting --url=http://APPID.appspot.com/remote_api --application=APPID

We’ll be asked to provide our email and password credentials. Here’s what my console output looks like:

Downloading data records.
[INFO    ] Logging to bulkloader-log-20100609.162353
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100609.162353.sql3
[INFO    ] Opening database: bulkloader-results-20100609.162353.sql3
[INFO    ] Connecting to java.latest.bootcamp-demo.appspot.com/remote_api
2010-06-09 16:23:57,022 WARNING appengine_rpc.py:399 ssl module not found.
Without the ssl module, the identity of the remote host cannot be verified, and
connections may NOT be secure. To fix this, please install the ssl module from
http://pypi.python.org/pypi/ssl .
To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl .
Please enter login credentials for java.latest.bootcamp-demo.appspot.com
Email: YOUR EMAIL
Password for YOUR_EMAIL:
[INFO    ] Downloading kinds: ['Greeting']
.[INFO    ] Greeting: No descending index on __key__, performing serial download
.
[INFO    ] Have 17 entities, 0 previously transferred
[INFO    ] 17 entities (11304 bytes) transferred in 10.5 seconds

There’s now a CSV file named data.csv in my directory, as well as a bunch of autogenerated bulkloader-* files for resuming if the loader dies midway during the export. My CSV file starts like this:

content,date,name,key
Hey it works!,2010-05-18T22:35:17,Ikai Lan,1
… (More lines of CSV)

The first line is a labeling line – this line designates the order in which properties have been exported. In our case, we’ve exported content, date and name in addition to Entity keys.

Uploading Data

To upload the CSV file back into the datastore, we run the following command:

appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting

This’ll use config.yml and create our entities in the remote datastore.

Adding a new field to datastore entities

One question that is frequently asked in the groups is, “How do I migrate my schema?” This question is generally poorly phrased; App Engine’s datastore is schemaless. That is – it is possible to have Entities of the same Kind with completely different sets of properties. Most of the time, this is a good thing. MySQL, for instance, requires a table lock to do a schema update. By being schema free, migrations can happen lazily, and application developers can check at runtime for whether a Property exists on a given Entity, then create or set the value as needed.

But there are times when this isn’t sufficient. One use case is if we want to change a default value on Entities and grandfather older Entities to the new default value, but we also want the default value to possibly be null. We can do tricks such as creating a new Property, setting an update timestamp, checking for whether the update timestamp is before or after when we made the code change and update conditionally, and so forth. The problem with this approach is that it introduces a TON of complexity into our application, and if we have more than one of these “migrations”, suddenly we’re writing more code to lazily grandfather data and confusing the non-Cylons that work on our team. It’s easier to migrate all the data. So how we do this? Before the new application code goes live, we migrate the schema by adding the new field. The best part about this is that we can do this without locking tables, so writes can continue.

Let’s add a new String field to our Greeting class: homepageUrl. Let’s assume that we want to set a default to http://www.google.com. How would we do this? Let’s update our config.yml file to the following:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property_map.
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object.  If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Greeting
  connector: csv

  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      import_transform: transform.key_id_or_name_as_string

    - property: content
      external_name: content
      # Type: String Stats: 7 properties of this type in this kind.

    - property: homepageUrl
      external_name: homepageUrl

    - property: date
      external_name: date
      # Type: Date/Time Stats: 7 properties of this type in this kind.
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')

    - property: name
      external_name: name
      # Type: String Stats: 7 properties of this type in this kind.

Note that we’ve added a new property with a new external_name. By default, the loader will use a String.

Now let’s add the field to our CSV file:

content,date,name,key,homepageUrl
Hey it works!,2010-05-18T22:35:17,Ikai Lan,1,http://www.google.com
... (more lines)

We’d likely write a script to augment our CSV file. Note that this only works if we have named keys! If we had integer keys before, we’ll end up creating duplicate entities using key names and not integer IDs.

Now we run the bulkloader to upload our entities:

appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting

Once our loader has finished running, we’ll see the new fields on our existing entities.

WARNING: There is a potential race condition here: if an Entity gets updated by our bulkloader in this fashion right as user facing code reads and updates the Entity without the new field, that will leave us with Entities that were grandfathered incorrectly. Fortunately, after we migrate, we can do a query for these Entities and manually update them. It’s slightly annoying, but far less painful than making bulkloader updates transactional.

Bootstrapping the datastore with default Entities

So we’ve covered the use case of using a generated config.yml file to update or load entities into the datastore, but what we haven’t yet covered is bootstrapping a completely new Entity Kind with never before seen data into the datastore.

Let’s add a new Entity Kind, Employee, to our datastore. We’ll preload this data:

name,title
Ikai Lan,Developer Programs Engineer
Patrick Chanezon,Developer Advocate
Wesley Chun,Developer Programs Engineer
Nick Johnson,Developer Programs Engineer
Jason Cooper,Developer Programs Engineer
Christian Schalk,Developer Advocate
Fred Sauer,Developer Advocate

Note that we didn’t add a key. In this case, we don’t care, so it simplifies our config files. Now let’s take a look at the config.yml we need to use:

python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Employee
  connector: csv

  property_map:

    - property: name
      external_name: name

    - property: title
      external_name: title

Now let’s go ahead and upload these entities:

$ appcfg.py upload_data --config_file=new_entity.yml --filename=new_entity.csv  --url=http://APPID.appspot.com/remote_api --kind=Employee
Uploading data records.
[INFO    ] Logging to bulkloader-log-20100610.151326
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100610.151326.sql3
[INFO    ] Connecting to APPID.appspot.com/remote_api
2010-06-10 15:13:27,334 WARNING appengine_rpc.py:399 ssl module not found.
Without the ssl module, the identity of the remote host cannot be verified, and
connections may NOT be secure. To fix this, please install the ssl module from
http://pypi.python.org/pypi/ssl .
To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl .
Please enter login credentials for APPID.appspot.com
Email: your.email@gmail.com
Password for your.email@gmail.com:
[INFO    ] Starting import; maximum 10 entities per post
.
[INFO    ] 7 entites total, 0 previously transferred
[INFO    ] 7 entities (5394 bytes) transferred in 8.6 seconds
[INFO    ] All entities successfully transferred

Boom! We’re done.

There are still a lot of bulkloader topics to discuss – related entities, entity groups, keys, and so forth. Stay tuned.

Written by Ikai Lan

June 10, 2010 at 2:52 pm

Posted in Uncategorized