Archive for June 10th, 2010
Using the bulkloader with Java App Engine
Checking for our Prerequisites:
If you already have Python 2.5.x and the Python SDK installed, skip this section.
First off, we’ll need to download the Python SDK. This example assumes we have Python version 2.5.x installed. If you’re not sure what version you have installed, open up a terminal and type “python”. This opens up a Python REPL, with the first line displaying the version of Python you’re using. Here’s example output:
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>>
(Yes, Pythonistas, the version on my laptop is ooooooooold).
Download the Python SDK from the following link. As of the writing of this post, the newest version is 1.3.4: Direct link.
Unzip this file. It’ll be easier for you if you place this in your path. Linux and OS X users will append this in their ~/.bash_profile:
PATH="/path/to/where/you/unzipped/appengine:${PATH}"
export PATH
To test that everything is working, type
appcfg.py
You’ll see a page of usage commands that starts out like this:
Usage: appcfg.py [options] <action> Action must be one of: create_bulkloader_config: Create a bulkloader.yaml from a running application. cron_info: Display information about cron jobs. download_data: Download entities from datastore. help: Print help for a specific action. request_logs: Write request logs in Apache common log format. rollback: Rollback an in-progress update. update: Create or update an app version. update_cron: Update application cron definitions. update_dos: Update application dos definitions. update_indexes: Update application indexes. update_queues: Update application task queue definitions. upload_data: Upload data records to datastore. vacuum_indexes: Delete unused indexes from application. Use 'help <action>' for a detailed description.
…. (and so forth)
Now we can go ahead and start using the bulkloader.
Using the bulkloader for Java applications
Before we can begin using the bulkloader, we’ll have to set it up first. Setting up the bulkloader is a three step process. We’ll need to:
1. Add RemoteApi to mapping
2. Generate a bulkloader configuration
Step 1: Add RemoteApi to our URI mapping
We’ll want to edit our web.xml. Add the following lines:
<servlet>
<servlet-name>RemoteApi</servlet-name>
<servlet-class>com.google.apphosting.utils.remoteapi.RemoteApiServlet</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>RemoteApi</servlet-name>
<url-pattern>/remote_api</url-pattern>
</servlet-mapping>
A common pitful with setting up RemoteApi is that developers using frameworks will use a catch-all expression for mapping URIs, and this will stomp over our servlet-mapping. Deploy this application into production. We’ll likely want to put an admin constraint on this.
Step 2: Generate a bulkloader configuration
This step isn’t actually *required*, but it certainly makes our lives easier, especially if we are looking to export existing data. In a brand new application, if we are looking to bootstrap our application with data, we don’t need this step at all. For completeness, however, it’d be best to go over it.
We’ll need to generate a configuration template. This step depends on datastore statistics having been updated with the Entities we’re looking to export. Log in to appspot.com and click “Datastore Statistics” under Datastore in the right hand menu.
If we see something that looks like the following screenshot, we can use this tool.
If we see something that looks like the screenshow below, then we can’t autogenerate a configuration since this is a brand new application – that’s okay, that means we probably don’t have much data to export. We’ll have to wait for App Engine’s background tasks to bulk update our statistics before we’ll be able to complete this step.
Assuming that we have datastore statistics available, we can use appcfg.py in the following manner to generate a configuration file:
appcfg.py create_bulkloader_config --url=http://APPID.appspot.com/remote_api --application=APPID --filename=config.yml
If the datastore isn’t ready, running this command will cause the following error:
[ERROR ] Unable to download kind stats for all-kinds download. [ERROR ] Kind stats are generated periodically by the appserver [ERROR ] Kind stats are not available on dev_appserver.
I’m using this on a Guestbook sample application I wrote for a codelab a while ago. The only Entities are Greetings, which consists of a String username, a String comment and a timestamp. This is what my config file looks like:
# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
# * Fill in connector and connector_options
# * Review the property_map.
# - Ensure the 'external_name' matches the name of your CSV column,
# XML tag, etc.
# - Check that __key__ property is what you want. Its value will become
# the key name on import, and on export the value will be the Key
# object. If you would like automatic key generation on import and
# omitting the key on export, you can remove the entire __key__
# property from the property map.
# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users
transformers:
- kind: Greeting
connector:
connector_options:
# TODO: Add connector options here--these are specific to each connector.
property_map:
- property: __key__
external_name: key
import_transform: transform.key_id_or_name_as_string
- property: content
external_name: content
# Type: String Stats: 7 properties of this type in this kind.
- property: date
external_name: date
# Type: Date/Time Stats: 7 properties of this type in this kind.
import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')
- property: name
external_name: name
# Type: String Stats: 7 properties of this type in this kind.
We care about the connector. Replace that with the following:
- kind: Greeting
connector: csv
We’ve only filled in the “connector” option. Now we have something we can use to dump data.
Examples of common usages of the bulkloader
Downloading data
We’ve got what we need to dump data. Let’s go ahead and do that now. Issue the following command:
appcfg.py download_data --config_file=config.yml --filename=data.csv --kind=Greeting --url=http://APPID.appspot.com/remote_api --application=APPID
We’ll be asked to provide our email and password credentials. Here’s what my console output looks like:
Downloading data records. [INFO ] Logging to bulkloader-log-20100609.162353 [INFO ] Throttling transfers: [INFO ] Bandwidth: 250000 bytes/second [INFO ] HTTP connections: 8/second [INFO ] Entities inserted/fetched/modified: 20/second [INFO ] Batch Size: 10 [INFO ] Opening database: bulkloader-progress-20100609.162353.sql3 [INFO ] Opening database: bulkloader-results-20100609.162353.sql3 [INFO ] Connecting to java.latest.bootcamp-demo.appspot.com/remote_api 2010-06-09 16:23:57,022 WARNING appengine_rpc.py:399 ssl module not found. Without the ssl module, the identity of the remote host cannot be verified, and connections may NOT be secure. To fix this, please install the ssl module from http://pypi.python.org/pypi/ssl . To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl . Please enter login credentials for java.latest.bootcamp-demo.appspot.com Email: YOUR EMAIL Password for YOUR_EMAIL: [INFO ] Downloading kinds: ['Greeting'] .[INFO ] Greeting: No descending index on __key__, performing serial download . [INFO ] Have 17 entities, 0 previously transferred [INFO ] 17 entities (11304 bytes) transferred in 10.5 seconds
There’s now a CSV file named data.csv in my directory, as well as a bunch of autogenerated bulkloader-* files for resuming if the loader dies midway during the export. My CSV file starts like this:
content,date,name,key Hey it works!,2010-05-18T22:35:17,Ikai Lan,1 ... (More lines of CSV)
The first line is a labeling line – this line designates the order in which properties have been exported. In our case, we’ve exported content, date and name in addition to Entity keys.
Uploading Data
To upload the CSV file back into the datastore, we run the following command:
appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting
This’ll use config.yml and create our entities in the remote datastore.
Adding a new field to datastore entities
One question that is frequently asked in the groups is, “How do I migrate my schema?” This question is generally poorly phrased; App Engine’s datastore is schemaless. That is – it is possible to have Entities of the same Kind with completely different sets of properties. Most of the time, this is a good thing. MySQL, for instance, requires a table lock to do a schema update. By being schema free, migrations can happen lazily, and application developers can check at runtime for whether a Property exists on a given Entity, then create or set the value as needed.
But there are times when this isn’t sufficient. One use case is if we want to change a default value on Entities and grandfather older Entities to the new default value, but we also want the default value to possibly be null. We can do tricks such as creating a new Property, setting an update timestamp, checking for whether the update timestamp is before or after when we made the code change and update conditionally, and so forth. The problem with this approach is that it introduces a TON of complexity into our application, and if we have more than one of these “migrations”, suddenly we’re writing more code to lazily grandfather data and confusing the non-Cylons that work on our team. It’s easier to migrate all the data. So how we do this? Before the new application code goes live, we migrate the schema by adding the new field. The best part about this is that we can do this without locking tables, so writes can continue.
Let’s add a new String field to our Greeting class: homepageUrl. Let’s assume that we want to set a default to http://www.google.com. How would we do this? Let’s update our config.yml file to the following:
# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
# * Fill in connector and connector_options
# * Review the property_map.
# - Ensure the 'external_name' matches the name of your CSV column,
# XML tag, etc.
# - Check that __key__ property is what you want. Its value will become
# the key name on import, and on export the value will be the Key
# object. If you would like automatic key generation on import and
# omitting the key on export, you can remove the entire __key__
# property from the property map.
# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users
transformers:
- kind: Greeting
connector: csv
connector_options:
# TODO: Add connector options here--these are specific to each connector.
property_map:
- property: __key__
external_name: key
import_transform: transform.key_id_or_name_as_string
- property: content
external_name: content
# Type: String Stats: 7 properties of this type in this kind.
- property: homepageUrl
external_name: homepageUrl
- property: date
external_name: date
# Type: Date/Time Stats: 7 properties of this type in this kind.
import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S')
export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')
- property: name
external_name: name
# Type: String Stats: 7 properties of this type in this kind.
Note that we’ve added a new property with a new external_name. By default, the loader will use a String.
Now let’s add the field to our CSV file:
content,date,name,key,homepageUrl Hey it works!,2010-05-18T22:35:17,Ikai Lan,1,http://www.google.com ... (more lines)
We’d likely write a script to augment our CSV file. Note that this only works if we have named keys! If we had integer keys before, we’ll end up creating duplicate entities using key names and not integer IDs.
Now we run the bulkloader to upload our entities:
appcfg.py upload_data --config_file=config.yml --filename=data.csv --url=http://APPID.appspot.com/remote_api --application=APPID --kind=Greeting
Once our loader has finished running, we’ll see the new fields on our existing entities.
WARNING: There is a potential race condition here: if an Entity gets updated by our bulkloader in this fashion right as user facing code reads and updates the Entity without the new field, that will leave us with Entities that were grandfathered incorrectly. Fortunately, after we migrate, we can do a query for these Entities and manually update them. It’s slightly annoying, but far less painful than making bulkloader updates transactional.
Bootstrapping the datastore with default Entities
So we’ve covered the use case of using a generated config.yml file to update or load entities into the datastore, but what we haven’t yet covered is bootstrapping a completely new Entity Kind with never before seen data into the datastore.
Let’s add a new Entity Kind, Employee, to our datastore. We’ll preload this data:
name,title Ikai Lan,Developer Programs Engineer Patrick Chanezon,Developer Advocate Wesley Chun,Developer Programs Engineer Nick Johnson,Developer Programs Engineer Jason Cooper,Developer Programs Engineer Christian Schalk,Developer Advocate Fred Sauer,Developer Advocate
Note that we didn’t add a key. In this case, we don’t care, so it simplifies our config files. Now let’s take a look at the config.yml we need to use:
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.api.datastore
- import: google.appengine.api.users
transformers:
- kind: Employee
connector: csv
property_map:
- property: name
external_name: name
- property: title
external_name: title
Now let’s go ahead and upload these entities:
$ appcfg.py upload_data --config_file=new_entity.yml --filename=new_entity.csv --url=http://APPID.appspot.com/remote_api --kind=Employee
Uploading data records. [INFO ] Logging to bulkloader-log-20100610.151326 [INFO ] Throttling transfers: [INFO ] Bandwidth: 250000 bytes/second [INFO ] HTTP connections: 8/second [INFO ] Entities inserted/fetched/modified: 20/second [INFO ] Batch Size: 10 [INFO ] Opening database: bulkloader-progress-20100610.151326.sql3 [INFO ] Connecting to APPID.appspot.com/remote_api 2010-06-10 15:13:27,334 WARNING appengine_rpc.py:399 ssl module not found. Without the ssl module, the identity of the remote host cannot be verified, and connections may NOT be secure. To fix this, please install the ssl module from http://pypi.python.org/pypi/ssl . To learn more, see http://code.google.com/appengine/kb/general.html#rpcssl . Please enter login credentials for APPID.appspot.com Email: your.email@gmail.com Password for your.email@gmail.com: [INFO ] Starting import; maximum 10 entities per post . [INFO ] 7 entites total, 0 previously transferred [INFO ] 7 entities (5394 bytes) transferred in 8.6 seconds [INFO ] All entities successfully transferred
Boom! We’re done.
There are still a lot of bulkloader topics to discuss – related entities, entity groups, keys, and so forth. Stay tuned.




