Getting started with jOOQ: A Tutorial
Introduction
I accidentally stumbled onto jOOQ a few days ago while doing a lot of research on Hibernate. Funny how things work, isn’t it? For those of you that aren’t familiar with it, jOOQ is a different approach to the over-ORMing of Java persistence. Rather than try to map database tables to Java classes and abstract away the SQL underneath, jOOQ assumes you want low level control over the SQL queries you execute, and provides a mostly typesafe interface for executing queries. I don’t have anything against simple ORMs, but it’s good to have the right tool for the right job. From the jOOQ homepage:</p?
Instead of this SQL query:
SELECT * FROM BOOK WHERE PUBLISHED_IN = 2011 ORDER BY TITLE
You would execute this Java code:
create.selectFrom(BOOK)
.where(PUBLISHED_IN.equal(2011))
.orderBy(TITLE)
Why a Java interface? Type safety, for one. Programmatically using jOOQ’s DSL has some advantages over writing SQL queries by hand, such as IDE support and compile time checking of some things.
The idea interested me and I dug in. Unfortunately, the jOOQ site’s documentation, while fairly comprehensive, DO NOT PROVIDE AN END TO END “GETTING STARTED” PAGE!!! This means that if you want to learn jOOQ, you’ll have to jump to the chapter about Meta model code generation, then jump to the DSL, then jump to jOOQ classes section. It’s a bit of a mess for new users. Google search also didn’t turn up many useful results, so I figured I’d whip up a quick “Getting started” guide. We’re going to go over the following steps:
Preparation: Download jOOQ and your SQL driver
Step 1: Create a SQL database and a table
Step 2: Generate classes
Step 3. Write a main class and establish MySQL connection
Step 4: Write a query using jOOQ’s DSL
Step 5: Iterate over results
Step 6: Profit!
Ready? Let’s get started.
Getting our hands dirty
Preparation: Download jOOQ and your SQL driver
If you haven’t already downloaded them, download jOOQ:
http://sourceforge.net/projects/jooq/files/
For this example, we’ll be using MySQL. If you haven’t already downloaded MySQL Connector/J, download it here:
http://dev.mysql.com/downloads/connector/j/
Stash these somewhere where you can get to them later.
Step 1: Create a SQL database and a table
We’re going to create a database called “guestbook” and a corresponding “posts” table. Connect to MySQL via your command line client and type the following:
create database guestbook; CREATE TABLE `posts` ( `id` bigint(20) NOT NULL, `body` varchar(255) DEFAULT NULL, `timestamp` datetime DEFAULT NULL, `title` varchar(255) DEFAULT NULL, PRIMARY KEY (`id`) );
(I copied and pasted the create table statement from a “show create table” command)
Step 2: Generate classes
In this step, we’re going to use jOOQ’s command line tools to generate classes that map to the Posts table we just created. The official docs are here.
I’m going to augment the command line steps a bit. The easiest way to generate a schema is to copy the jOOQ jar files (there should be 3) and the MySQL Connector jar file to a temporary directory. Create a properties file. I’ve created a file called guestbook.properties that looks like this:
#Configure the database connection here jdbc.Driver=com.mysql.jdbc.Driver jdbc.URL=jdbc:mysql://localhost:3306/guestbook jdbc.Schema=guestbook jdbc.User=ikai jdbc.Password= #The default code generator. You can override this one, to generate your own code style #Defaults to org.jooq.util.DefaultGenerator generator=org.jooq.util.DefaultGenerator #The database type. The format here is: #generator.database=org.util.[database].[database]Database generator.database=org.jooq.util.mysql.MySQLDatabase #All elements that are generated from your schema (several Java regular expressions, separated by comma) #Watch out for case-sensitivity. Depending on your database, this might be important! generator.database.includes=.* #All elements that are excluded from your schema (several Java regular expressions, separated by comma). Excludes match before includes generator.database.excludes= #Primary key / foreign key relations should be generated and used. #This will be a prerequisite for various advanced features #Defaults to false generator.generate.relations=true #Generate deprecated code for backwards compatibility #Defaults to true generator.generate.deprecated=false #The destination package of your generated classes (within the destination directory) generator.target.package=test.generated #The destination directory of your generated classes generator.target.directory=/Users/ikai/workspace/MySQLTest/src
One thing that wasn’t clear from jOOQ’s docs is the value of jdbc.Schema: it should be your database name. Since our database name is “guestbook”, that’s what we put. Replace the username with whatever user has the appropriate privileges: in my local dev database, my user has what is effectively root access to everything without a password. You’ll want to look at the other values and replace as necessary. Here are the two interesting properties:
generator.target.package – set this to the parent package you want to create for the generated classes. My setting of test.generated will cause the test.generated.Posts and test.generated.PostsRecord to be created
generator.target.directory – the directory to output to. Worst case scenario you can just copy the files to the package.
Once you have the JAR files and guestbook.properties in your temp directory, type this:
java -classpath jooq-1.6.8.jar:jooq-meta-1.6.8.jar:jooq-codegen-1.6.8.jar:mysql-connector-java-5.1.18-bin.jar:. org.jooq.util.GenerationTool /jooq.properties
Note the prefix slash before jooq.properies. Even though it’s in our working directory, we need to prepend a slash.
Replace the filenames with your filenames. In this example, I’m using jOOQ 1.6.8. If everything has worked, you should see this in your console output:
Nov 1, 2011 7:25:06 PM org.jooq.impl.JooqLogger info INFO: Initialising properties : /jooq.properties Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Database parameters Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: ---------------------------------------------------------- Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: dialect : MYSQL Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: schema : guestbook Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: target dir : /Users/ikai/Documents/workspace/MySQLTest/src Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: target package : test.generated Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: ---------------------------------------------------------- Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Emptying : /Users/ikai/workspace/MySQLTest/src/test/generated Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Generating classes in : /Users/ikai/workspace/MySQLTest/src/test/generated Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Generating schema : Guestbook.java Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Generating factory : GuestbookFactory.java Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Schema generated : Total: 122.18ms Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Sequences fetched : 0 (0 included, 0 excluded) Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Masterdata tables fetched: 0 (0 included, 0 excluded) Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Tables fetched : 5 (5 included, 0 excluded) Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Generating tables : /Users/ikai/workspace/MySQLTest/src/test/generated/tables Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: ARRAYs fetched : 0 (0 included, 0 excluded) Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Enums fetched : 0 (0 included, 0 excluded) Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: UDTs fetched : 0 (0 included, 0 excluded) Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Generating table : Posts.java Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Tables generated : Total: 680.464ms, +558.284ms Nov 1, 2011 7:25:07 PM org.jooq.impl.JooqLogger info INFO: Generating Keys : /Users/ikai/workspace/MySQLTest/src/test/generated/tables Nov 1, 2011 7:25:08 PM org.jooq.impl.JooqLogger info INFO: Keys generated : Total: 718.621ms, +38.157ms Nov 1, 2011 7:25:08 PM org.jooq.impl.JooqLogger info INFO: Generating records : /Users/ikai/workspace/MySQLTest/src/test/generated/tables/records Nov 1, 2011 7:25:08 PM org.jooq.impl.JooqLogger info INFO: Generating record : PostsRecord.java Nov 1, 2011 7:25:08 PM org.jooq.impl.JooqLogger info INFO: Table records generated : Total: 782.545ms, +63.924ms Nov 1, 2011 7:25:08 PM org.jooq.impl.JooqLogger info INFO: Routines fetched : 0 (0 included, 0 excluded) Nov 1, 2011 7:25:08 PM org.jooq.impl.JooqLogger info INFO: Packages fetched : 0 (0 included, 0 excluded) Nov 1, 2011 7:25:08 PM org.jooq.impl.JooqLogger info INFO: GENERATION FINISHED! : Total: 791.688ms, +9.143ms
Step 3. Write a main class and establish MySQL connection
Let’s just write a vanilla main class in the project containing the generated classes:
public class Main {
public static void main(String[] args) {
Connection conn = null;
String userName = "ikai";
String password = "";
String url = "jdbc:mysql://localhost:3306/guestbook";
try {
Class.forName("com.mysql.jdbc.Driver").newInstance();
conn = DriverManager.getConnection(url, userName, password);
conn.close();
} catch (Exception e) {
// You'll probably want to handle the exceptions in a real app
// Don't ever do this silence catch(Exception e) thing. I've seen this in
// live code and it is horrendous.
e.printStackTrace();
}
}
}
This is pretty standard code for establishing a MySQL connection.
Step 4: Write a query using jOOQ’s DSL
Let’s add a simple query:
GuestbookFactory create = new GuestbookFactory(conn); Result result = create.select().from(Posts.POSTS).fetch();
We need to first get an instance of GuestbookFactory so we can write a simple SELECT query. We pass an instance of the MySQL connection to GuestbookFactory. Note that factory doesn’t close the connection. We’ll have to do that ourselves.
We then use jOOQ’s DSL to return an instance of Result. We’ll be using this result in the next step.
Step 5: Iterate over results
After the line where we retrieve the results, let’s iterate over the results and print out the data:
for (Record r : result) {
Long id = r.getValueAsLong(Posts.ID);
String title = r.getValueAsString(Posts.TITLE);
String description = r.getValueAsString(Posts.BODY);
System.out.println("ID: " + id + " title: " + title + " desciption: " + description);
}
The full program should now look like this:
package test;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import org.jooq.Record;
import org.jooq.Result;
import test.generated.GuestbookFactory;
import test.generated.tables.Posts;
public class Main {
/**
* @param args
*/
public static void main(String[] args) {
Connection conn = null;
String userName = "ikai";
String password = "";
String url = "jdbc:mysql://localhost:3306/guestbook";
try {
Class.forName("com.mysql.jdbc.Driver").newInstance();
conn = DriverManager.getConnection(url, userName, password);
GuestbookFactory create = new GuestbookFactory(conn);
Result result = create.select().from(Posts.POSTS).fetch();
for (Record r : result) {
Long id = r.getValueAsLong(Posts.ID);
String title = r.getValueAsString(Posts.TITLE);
String description = r.getValueAsString(Posts.BODY);
System.out.println("ID: " + id + " title: " + title + " desciption: " + description);
}
conn.close();
} catch (Exception e) {
// You'll probably want to handle the exceptions in a real app
// Don't ever do this silence catch(Exception e) thing. I've seen this in
// live code and it is horrendous.
e.printStackTrace();
}
}
}
Step 6: Profit!
Get a job and go to work like the rest of us.
Conclusion
I haven’t explored the more advanced bits of jOOQ, but, at least judging from the docs, it looks like there’s a lot of meat there. I’m hoping this guide makes it easier for new users to dive in.
- ikai
Currently listening: Sweat – Snoop Dogg vs David Guetta
On Hackathons, Process, Email and the Tragedy of the Commons
Hackathons
I love hackathons. I love going to them, and I love running them. Most recently, I participated in a 48-hour hackathon in Kuala Lumpur, Malaysia. It’s one of the best parts of my job. I get to run (and sometimes participate) in both external hackathons, as well as hackathons that are internal to Google.
In early June I held an internal Hackathon at Google to teach employees how to best use the product I work on: Google App Engine. I consider the event a success: we had hundreds of RSVPs and a completely booked room. It was so successful, in fact, that I’m planning on holding at least one of these events a quarter. The breakdown was primarily newer employees, which didn’t surprise me giving the amount of hiring we’ve been doing.
A primary driver for the sheer volume of RSVPs was the fact that we advertised the event on a mailing list that went out to pretty much all of engineering. All. Of. It. An engineering company with headcount in the tens of thousands, hundreds of RSVPs was not only likely, it was pretty much a mathematical certainty. Looking back, we would probably not have received the response if we didn’t sent out such a wide blast.
As a result of what I consider to be a fairly successful event (and I don’t mean to take all the credit here, at about the same time as my event, there was another very successful internal hackathon), various teams have suggested hackathons for their product APIs. There are events on the calendar.
Therein, of course, lies our problem. The problem of noise.
What should we do? Email all of engineering for every event? Create a new list/site/page announcing new events? Let’s break down the tradeoffs for each choice:
1. email all of engineering
Pros: goes to everyone
Cons (and this is the bigger point of this post): the majority of events will be irrelevant, causing the signal-to-noise ratio on the list to significantly drop, causing people to filter out these annoucements
2. Create a new distribution channel for events
Pros: Opt in
Cons: You don’t get the distribution you’d get with #1, since only a minority of people will opt in. Also – has the same SNR problems.
Now, a hybrid solution would be to do both. High profile, important events go to all of engineering, and smaller events go to the special distribution channel. The issue here is that everyone’s event is high profile. So again, we don’t have a great solution. Not to mention: people can only attend so many hackathons and still be able to do all the stuff they’re supposed to be doing. See, that’s one of the great things about Google engineering. If you’re consistently delivering, there isn’t a manager in the company that will tell you not to attend a hackathon or internal event where you can only get better at what you do. The issue, of course, is that the more hackathons take place, you are likely taking something a resource away from another team for a non-trivial amount of time. From a hackathon organizer’s perspective, a hackathon is almost always beneficial as long as some non-zero number of participants show up: they learn about your API, provide feedback and you learn a bit about how to improve the documentation or SDK. You almost can’t afford not to throw a hackathon.
This is the classic example of the tragedy of the commons. By running an event, you consume space. You consume employee time. You generate noise on all the distribution channels. And when everyone does it, suddenly, as a whole, everyone is worse off, though you yourself may individually gain.
Another key example of the tragedy of the commons is a company’s email marketing. I worked at a consumer internet company that broke teams by product. To drive usage metrics for an individual product, the product managers would run email campaigns to the site’s millions of users. The result was that the individual product would receive for usage, and everyone would give themselves a pat on the back. What was actually happening was that it was causing users to become extremely irritated at the company (myself included) for the voluminous amounts of email being sent all the time. Sure, you could go to the site settings and disable email, but new products would automatically opt you in to receiving notifications, and you would have to log back into the site to find the settings and disable those notifications as well. Some users, like myself, have created Gmail filters to completely send all emails from this company’s domain to a “Stupid Mail” label. I can understand the individual product managers’ reasoning. You don’t want to be the one team that doesn’t deliver metrics, so you email spam. And when everyone email spams, it’s to the detriment of the company overall. An employee posted to an internal group asking if it was an example of the tragedy of the commons – I don’t know if his advice was ever heeded, but based on the complaints I see on Twitter about email, my guess is no.
Process
I view team processes the same way, and this sometimes leads to some very heated discussions with people I work with. It’s not that I don’t believe making your 1 step process a 5 step process doesn’t make your life easier or the company better organized; it’s that everybody wants to turn their processes from one step, lightweight, free form processes into full on, form driven, strict-requirements-based, signed-in-triplicate steps for doing things. I fight heavy processes when I can because I don’t believe enough people do so. Why? The tragedy of the commons. An extra 20 minutes here, and extra 20 minutes there, and suddenly, I am spending most of my day tangled in process instead of getting things done.
There are no easy solutions to this, of course. Some process is necessary, though from the onset, it isn’t always obvious which ones. How do you know, for instance, if a process is unnecessary? A good example is a managerial approval step in a process. Let’s say I need approval to do something. How do I evaluate if managerial approval is working?
- What is the cost of doing it wrong? What was the bad outcome?
- What was the number of incidences in which, prior to the institution of the process, that approval would have prevented a bad outcome?
- Is the manager rubber stamping requests?
What absolutely needs to be done are constant evaluations of process. Don’t create a process and sit on it. Make it better. What can you take away, and still have it work? Think about your last trip to the DMV. How many steps could have been eliminated?
Awareness of the bigger picture
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
Antoine de Saint-Exupery
French writer (1900 – 1944)
I suppose that’s the solution to fighting the tragedy of the commons. A constant awareness of the bigger picture and a real desire to make things better. An understanding that many things in this world are a zero sum game. I’ll issue caution, of course: you can probably only champion a few things. Championing fixing everything, and people stop listening to you, you lose focus, and you will end up fixing very little. What do we call this effect? No, I won’t bother. Hopefully you actually read this and already know.
- Ikai
Setting up an OAuth provider on Google App Engine
App Engine provides an API for easily creating an OAuth provider. In this blog post, I’ll describe the following steps:
- Create and deploy an App Engine application the implements the OAuth API
- Add a new domain to your Google Account. Verify this domain.
- Connecting an OAuth client to make requests against your application
I’ll avoid a deep explanation of OAuth for now. We can find everything you need to know about OAuth in the Beginner’s guide to OAuth.
Get the code
The code that goes along with this blog post is available here:
https://github.com/ikai/appengine-oauth-java-server-python-client-sample
The two most important files are:
- python/oauth_client.py
- src/com/ikai/oauthprovider/ProtectedServlet.java
Step 1: Create and deploy an App Engine application that uses the OAuth API
Create a new App Engine Java application. I’ve created a servlet called ProtectedServlet:
package com.ikai.oauthprovider;
import com.google.appengine.api.oauth.OAuthRequestException;
import com.google.appengine.api.oauth.OAuthService;
import com.google.appengine.api.oauth.OAuthServiceFactory;
import com.google.appengine.api.users.User;
import java.io.IOException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
@SuppressWarnings("serial")
public class ProtectedServlet extends HttpServlet {
@Override
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException {
User user = null;
try {
OAuthService oauth = OAuthServiceFactory.getOAuthService();
user = oauth.getCurrentUser();
resp.getWriter().println("Authenticated: " + user.getEmail());
} catch (OAuthRequestException e) {
resp.getWriter().println("Not authenticated: " + e.getMessage());
}
}
}
This servlet is incredibly simple. We retrieve an instance of OAuthService via OAuthServiceFactory and attempt to fetch the current user. Note that the User instance is the same kind of instance as a User returned by UserService. That’s because a User is still expected to sign in via a Google Account.
The method getCurrentUser() takes care of all of the OAuth signature verification. If something goes wrong – say, the request is not signed, or the signature is invalid, or the client’s timestamp is outside of the acceptable skew, or the nonce is repeated – OAuthService throws OAuthRequestException.
We can run this code locally, but it won’t work. When run locally, oauth.getCurrentUser() always returns a test user. Wel need to deploy it to App Engine before it’ll do verification. After deploy, we can test the servlet. I have the servlet mapped to /resource. When we browse to this URL, we see:
Not authenticated: Unknown
That’s okay. We expect to see this because we’re sending a vanilla GET to this API.
2. Add a new domain to your Google Account. Verify this domain
OAuth clients require a consumer key and consumer token. We need to generate these. Browse to the “Manage Domains” page:
https://www.google.com/accounts/ManageDomains
It should look like this:
Add the base URL of our App Engine app into the text box in the “Add a New Domain” section and click “Add domain”. For instance, I entered: http://ikai-oauth.appspot.com.
We’ll be taken to a new page where we need to verify ownership of the application:
Download the HTML verification file and place it into our war directory. Deploy this new version of the application to App Engine. Once we have confirmed that the page is serving, click “Verify” to complete the verification process.
When we have verified our domain, we will be asked to accept the Terms of Service and enter a few settings. Only the authsub setting is required; we can enter anything we want here because we will not be using authsub. We will then be presented with an OAuth consumer key and OAuth consumer secret. The OAuth consumer key is simply the domain, whereas the consumer secret is an autogenerated shared secret that clients will be using.
Now we have these values, we can move on to step 3.
3. Connecting an OAuth client to make requests against your application
As of the time of this writing, App Engine only supports OAuth 1.0.
Below is a basic script that will do the 3-legged OAuth dance, cache access tokens locally and make API calls. To run this script, you will need to install the python-oauth2 library. If we have git installed, the commands to install the library on a *Nix like system are:
git clone https://github.com/simplegeo/python-oauth2.git cd python-oauth2 sudo python setup.py install
This installs the oauth2 library into your Python install so you can import it when we need it.
Now we can run the script to make authenticated calls against our app. Note that we’ll want to substitute the consumer_secret and app_id values with values that map to your application ID and consumer secret:
import oauth2 as oauth
import urlparse
import os
import pickle
app_id = "your_app_id_here"
url = "http://%s.appspot.com/resource" % app_id
consumer_key = '%s.appspot.com' % app_id
consumer_secret = 'your_consumer_secret_here'
access_token_file = "token.dat"
request_token_url = "https://%s.appspot.com/_ah/OAuthGetRequestToken" % app_id
authorize_url = "https://%s.appspot.com/_ah/OAuthAuthorizeToken" % app_id
access_token_url = "https://%s.appspot.com/_ah/OAuthGetAccessToken" % app_id
consumer = oauth.Consumer(consumer_key, consumer_secret)
if not os.path.exists(access_token_file):
client = oauth.Client(consumer)
# Step 1: Get a request token. This is a temporary token that is used for
# having the user authorize an access token and to sign the request to obtain
# said access token.
resp, content = client.request(request_token_url, "GET")
if resp['status'] != '200':
raise Exception("Invalid response %s." % resp['status'])
request_token = dict(urlparse.parse_qsl(content))
print "Request Token:"
print " - oauth_token = %s" % request_token['oauth_token']
print " - oauth_token_secret = %s" % request_token['oauth_token_secret']
print
print "Go to the following link in your browser:"
print "%s?oauth_token=%s" % (authorize_url, request_token['oauth_token'])
print
# After the user has granted access to you, the consumer, the provider will
# redirect you to whatever URL you have told them to redirect to. You can
# usually define this in the oauth_callback argument as well.
accepted = 'n'
while accepted.lower() == 'n':
accepted = raw_input('Have you authorized me? (y/n) ')
# Step 3: Once the consumer has redirected the user back to the oauth_callback
# URL you can request the access token the user has approved. You use the
# request token to sign this request. After this is done you throw away the
# request token and use the access token returned. You should store this
# access token somewhere safe, like a database, for future use.
token = oauth.Token(request_token['oauth_token'],
request_token['oauth_token_secret'])
client = oauth.Client(consumer, token)
resp, content = client.request(access_token_url, "POST")
access_token = dict(urlparse.parse_qsl(content))
print "Access Token:"
print " - oauth_token = %s" % access_token['oauth_token']
print " - oauth_token_secret = %s" % access_token['oauth_token_secret']
print
print "You may now access protected resources using the access tokens above."
print
token = oauth.Token(access_token['oauth_token'],
access_token['oauth_token_secret'])
with open(access_token_file, "w") as f:
pickle.dump(token, f)
else:
with open(access_token_file, "r") as f:
token = pickle.load(f)
client = oauth.Client(consumer, token)
resp, content = client.request(url, "GET")
print "Response Status Code: %s" % resp['status']
print "Response body: %s" % content
(The basis for this script was shamelessly stolen from Joe Stump’s sample oauth2 code for his Python library on Github.)
Once we run the script using:
python oauth_client.py
we should see:
Request Token: - oauth_token = SOME_OAUTH_REQUEST_TOKEN_VALUE - oauth_token_secret = SOME_OAUTH_REQUEST_SECRET_VALUE Go to the following link in your browser: https://YOUR-APP-ID.appspot.com/_ah/OAuthAuthorizeToken?oauth_token=SOME_OAUTH_REQUEST_TOKEN_VALUE Have you authorized me? (y/n)
The OAuth token and token secret values are generated by the script using a combination of random values and the consumer key/secret pair. With these values, known as request tokens, you generate an authorization URL for an end user to bless our client so it can make OAuth requests on the behalf of the user that grants authorization.
At this point, the script pauses for input. As part of the OAuth dance, we need to browse to the URL provide and authorize the script. Copy/paste this URL into your browser window and click “Grant Access”:
Once we see a page that says:
You have successfully granted ikai-oauth.appspot.com access to your Google Account. You can revoke access at any time under ‘My Account’.
We can switch back to your terminal window and hit “y”. The client now exchanges our request tokens for access tokens. Access tokens are what you need to make API calls. The script outputs this:
Access Token: - oauth_token = SOME_OAUTH_ACCESS_TOKEN - oauth_token_secret = SOME_OAUTH_ACCESS_TOKEN_SECRET You may now access protected resources using the access tokens above. Response Status Code: 200 Response body: Authenticated: the-account-you-logged-in-with@gmail.com
The Python script caches the access token in a file called token.dat, so the next time we run oauth_client.py, we skip the authorization dance and can directly make API calls:
$ python oauth_client.py Response Status Code: 200 Response body: Authenticated:the-account-you-logged-in-with@gmail.com
That’s all there is to it!
Final notes and general tips
Setting up an OAuth provider using App Engine’s API is incredibly simple once we know all the steps. Setting up the provider is just a matter of a few lines of code, and the steps to set up the client are pretty straightforward. The most difficult part is setting up the consumer key and secret, but even that isn’t so bad once we know where the management interface is.
When possible, use OAuth instead of ClientLogin. This goes for web applications, mobile applications, desktop apps, and even command line scripts. OAuth allows users to revoke your access token and trains users not to arbitrarily give out their Google Account password to any interface that asks for it. For building clients, it also gives you a way to do client authentication without having to cache credentials – using ClientLogin too often results in CaptchaRequiredException being thrown, anyway.
- Ikai
References
Github sample code:
https://github.com/ikai/appengine-oauth-java-server-python-client-sample
App Engine/Java OAuth docs: http://code.google.com/appengine/docs/java/oauth/overview.html
Domain management – get your consumer key/secret here: https://www.google.com/accounts/ManageDomains
Python OAuth client code: https://github.com/simplegeo/python-oauth2
Unit Testing in Tipfy, an App Engine framework in Python
I’ve been playing around with the Tipfy framework for App Engine. Tipfy is a framework built on top of App Engine’s APIs that provides many features on top of what is currently possible. I won’t go too much into their virtues here.
One thing that’s bothered me is the dearth of a testing guide. More disturbing still is that one of the top search results for unit testing is a groups post of a developer bragging that he doesn’t write tests (let’s hope no one ever has to work with you). Digging around, it’s clear that Rodrigo Moraes, the creator of Tipfy, emphasizes testing in his own app, as can be evidence by the testing package in the Tipfy source repository. I’ve decided to write this quick guide to help other developers to try to save some time having to do the detective work I’ve had to do to get unit tests running.
Shortcut
So – if you don’t want to read, you can just skip ahead and read this code sample which shows an example of how to write tests for the demo “Hello, World” application that comes as part of the Tipfy download.
Getting Started
We’re going to need a few different tools to run tests. Note that we don’t need need them, I just find that using these tools will make our life a lot easier:
- Nose – Nose is a popular Python test discovery and execution tool. Nose will dig through your source directory and run your tests
- Nose GAE plugin – this is the plugin that makes nose play nice with the local App Engine SDK
If you don’t already have these tools installed, go ahead and install them with easy_install:
sudo easy_install nose sudo easy_install nosegae
We’ll also need to make sure tipfy is on our PYTHONPATH. Look for tipfy under YOUR_TIPFY_INSTALL/app/distlib. Here’s what I see as of the writing of this post:
distlib ikai$ ls README.txt babel jinja2 tipfy werkzeug
Add this to your PYTHONPATH by adding a line to ~/.bash_profile (or equivalent on your system):
export PYTHONPATH="/path/to/root/of/tipfy/libraries
If needed, run:
source ~/.bash_profile
Alright, you’re ready to roll. Run a test from the root of your application directory. It’s probably easiest to do this from the directory app.yaml resides in:
nosetests -d --with-gae --without-sandbox -v
Note that this assumes your App Engine SDK lives at /usr/local/google_appengine. If it doesn’t, either symlink it or pass the –gae-lib-root flag.
You only really need –with-gae and –without-sandbox flags, but I like the other flags. Type nosetests –help for a full description of the commands available.
Now let’s write some tests.
Writing tests
Now let’s create a new file for tests. Tipfy has a concept of apps within a project (think Django apps), so for this example, I’ll create a file called tests.py in each app directory for each organization (we’ll have to remember to create a setting in app.yaml to not upload this file, but this isn’t crucial). The responsibility of the tests in this file will be to run the tests for the app it’s colocated with. It’d be equally valid to create a test directory.
Here’s our tests.py:
import unittest
from tipfy import RequestHandler, Tipfy
import urls
class TestHandler(unittest.TestCase):
def setUp(self):
self.app = Tipfy(rules=urls.get_rules(None))
self.client = self.app.get_test_client()
def test_hello_world_handler(self):
response = self.client.get('/', follow_redirects=True)
self.assertEquals(response.data, "Hello BLAH")
def test_pretty_hello_world_handler(self):
response = self.client.get('/pretty')
self.assertTrue("Hello, World!" in response.data)
Let’s talk through what we’re doing here step by step:
def setUp(self):
self.app = Tipfy(rules=urls.get_rules(None))
self.client = self.app.get_test_client()
If you’re using to Python testing, this shouldn’t look too surprising to you. The setUp function is run before each test. We’re doing two things here:
- Initialize an instance of the app. We’ve imported the urls module from this app, so we can call get_rules() on it to get our URL mappings. We’re passing None to this because it expects an app, but as luck would have it, the “Hello World” demo doesn’t actually use this paramter.
- We’re initializing an instance of the test client. This is what we’ll be using to make requests
Now let’s talk about the tests
def test_hello_world_handler(self):
response = self.client.get('/', follow_redirects=True)
self.assertEquals(response.data, "Hello BLAH")
def test_pretty_hello_world_handler(self):
response = self.client.get('/pretty')
self.assertTrue("Hello, World!" in response.data)
In test_hello_world_handler(), we use self.client.get() to make a call to the”/” URL. Note that we’ve passed a follow_redirects argument; we don’t actually need this. This is just something I copied over from Rodrigo’s original testing example. We test to ensure that the response equals the output.
In our second test, we test the “pretty” version of this handler. We look for a String inside, but really it’s up to us how we want to do this. In general, we don’t want to look for an exact match of the output, since this makes our test extremely brittle and we’ll end up either not maintaining or deleting this test.
Advanced users will likely have all the handlers extend a BaseHandler RequestHandler class and call self.render(). We can point the render method to a Mock method, then try to capture the context parameters that were passed. (this is a bit out of scope for this post, but I may follow up this post with some quick samples of how to do Mocking – I like Michael Foord’s Mock library.
Writing tests with the datastore
Let’s do something a bit more interesting. Let’s run some tests with the datastore. We’ll also demonstrate some other ways of testing Tipfy. Let’s consider the following, updated code snippet:
# Install nose and nosegae:
# sudo easy_install nose
# sudo easy_install nosegae
#
# run via:
# nosetests --with-gae --without-sandbox -v
import unittest
from tipfy import RequestHandler, Rule, Tipfy
# Need this import for testing
from google.appengine.api import apiproxy_stub_map, datastore_file_stub
from google.appengine.ext import db
import urls
class Comment(db.Model):
body = db.StringProperty()
class TestHandler(unittest.TestCase):
def setUp(self):
"""
We use this to clear the datastore. Thanks to Gaetestbed for
his example here:
https://github.com/jgeewax/gaetestbed/blob/master/gaetestbed/datastore.py
"""
datastore_stub = apiproxy_stub_map.apiproxy._APIProxyStubMap__stub_map['datastore_v3']
datastore_stub.Clear()
# We're importing rules from the sample app
# The sample app doesn't require an app
self.app = Tipfy(rules=urls.get_rules(None))
self.client = self.app.get_test_client()
def test_hello_world_handler(self):
response = self.client.get('/', follow_redirects=True)
self.assertEquals(response.data, "Hello BLAH")
def test_pretty_hello_world_handler(self):
response = self.client.get('/pretty')
self.assertTrue("Hello, World!" in response.data)
def test_save_comment(self):
class DatastorePostHandler(RequestHandler):
def post(self):
body = self.request.form.get("body")
comment = Comment()
comment.body = body
comment.save()
return "OK"
rules = [
Rule('/ds', endpoint='ds', handler=DatastorePostHandler),
]
app = Tipfy(rules=rules)
client = app.get_test_client()
response = client.post('/ds')
self.assertEquals(response.data, "OK")
comments = Comment.all().fetch(100)
self.assertEquals(1, len(comments))
Revisiting the setUp() method, we see that we have a new line of code:
datastore_stub = apiproxy_stub_map.apiproxy._APIProxyStubMap__stub_map['datastore_v3'] datastore_stub.Clear()
Between test invocations, the datastore stub is NOT cleared. This lets us do it, since the last thing we want is to have state persist between tests. That’s a very bad practice I occasionally see in “clever” attempts to save lines of code. Don’t do it. It causes flaky tests and will give you hours of pain. Reset your state and rebuild it each time.
test_save_comment() defines a handler and a set of rules for our Tipfy instance. We probably won’t be doing this for non-trivial applications, since the whole point is to test some handler code we wrote, but it serves our purpose for this example. We want to test for a side effect – in this case, that a comment was saved. In a more complete test, we would not only test for the number of comments, but we’d also test that the body was saved. Notice the difference in our call to client.post() – this invokes an HTTP POST instead of an HTTP GET.
When we run nosetests with the command above, we get:
$ nosetests -d --with-gae --without-sandbox -v test_hello_world_handler (apps.hello_world.tests.TestHandler) ... ok test_pretty_hello_world_handler (apps.hello_world.tests.TestHandler) ... ok test_save_comment (apps.hello_world.tests.TestHandler) ... ok ---------------------------------------------------------------------- Ran 3 tests in 0.206s OK
And life is good again.
Final notes on testing
I’m not one of these people that believe that 100% test coverage, or even 80% test coverage is needed for a project to be well covered. The payoff for that much coverage often involves lots and lots of code is relatively minor, especially for trivial code paths.
I also see a lot of developers completely isolate each layer of the stack. In the datastore example above, these developers would have completely mocked out the datastore layer. I don’t find this to be a useful practice by default, as you end up testing your mocks and not the code. There are cases where this practice is useful, but in most cases, you will have more confidence in your code if you take the time to define a correct set of fixtures. Where you’ll 100% want mocks are places where you have complex or external services that can be flaky, or when you need to replicate failure conditions that are difficult to programmatically cause in your code.
Don’t think of testing as a replacement for QA because it’s not. In web testing, think of it as a replacement for opening a browser and clicking. When you discover a bug, you write a test for it and try to fix it, because in most cases setting up the error state will be much easier programmatically than manually. You’re always going to have to do browser testing at some point, but it’s time consuming, especially if you need your data in a specific state. You could go the Selenium route for full coverage, but in my experience (people are going to disagree with me on this – get ready for comment/Twitter trolling), Selenium tests, while providing a high level of confidence, also are extremely brittle and are a maintenance nightmare if you have too many of them. You’ll want to write as many tests as you can outside the browser environment and save Selenium for the minority of your user flows that are critical – write Javascript unit tests instead of Selenium tests for client side functionality. I’ve used JsUnit before and heard good things about Jasmine but never had experience with it myself.
And my last tip? Do what works for your team. But do write tests, because it’s one of those practices that will pay off over time if you write AND maintain them well.
- Ikai
App Engine datastore tip: monotonically increasing values are bad
When saving entities to App Engine’s datastore at a high write rate, avoid monotonically increasing values such as timestamps. Generally speaking, you don’t have to worry about this sort of thing until your application hits 100s of queries per second. Once you’re in that ballpark, you may want to examine potential hotspots in your application that can increase datastore latency.
To explain why this is, let’s examine what happens to the underlying Bigtable of an application with a high write rate. When a Bigtable tablet, a contiguous unit of storage, experiences a high write rate, the tablet will have to “split” into more than one tablet. This “split” allows new writes to shard. Here’s a visual approximation of what happens:
There’s a moment of pain – this is one of the causes of datastore timeouts in high write applications, as discussed in Nick Johnson‘s article, “Handling Datastore Errors“.
Remember that for indexed values, we must write corresponding index rows. When values are randomly or even semi-randomly distributed, like, say, user email addresses, tablet splits function well. This is because the work to write multiple values is distributed amongst several Bigtable tablets:
The problems appear when we start saving monotonically increasing values like timestamps, or insert dictionary words in alphabetical order:
The new writes aren’t evenly distributed, and whichever tablet they end up going to end up becoming a new hot tablet in need of a split.
As a developer, what can you do to avoid this situation?
- Avoid indexes unless you need to query against the values. No index = no hot tablet on increasing value
- Lower your write rate, or figure out how to better distribute values. A pure random distribution is best, but even a distribution that isn’t random will be better than a predictable, monotonically increasing value
- Prefix a shard identifier to your value. This is problematic if you plan on doing queries, as you will need to prefix and unprefix the values, then join the results in memory – but it will reduce the error rate of your writes
The tips are applicable whether you are on Master-Slave or High Replication datastore. And one more tip: don’t prematurely optimize for this case, since chances are, you won’t run into it. You can be spending that time working on features.
- Ikai
P.S. Yes, I drew those doodles. No, I do not have any formal art training (how could you tell?!)
GWT, Blobstore, the new high performance image serving API, and cute dogs on office chairs
I’ve been working on an image sharing application using GWT and App Engine to familiarize myself with the newer aspects of GWT. The project and code are here:
http://ikai-photoshare.appspot.com
http://github.com/ikai/gwt-gae-image-gallery
(Please excuse spaghetti code in client side GWT code, much of it was me feeling my way around GWT. I’ve come to appreciate GWT quite a bit in spite of the fact that I’m pretty familiar with client side development; I’ll write about this in a future post).
The 1.3.6 release of the App Engine SDK shipped with a high performance image serving API. What this means is that a developer can take a blob key pointing to image data stored in the blobstore and call getServingUrl() to create a special URL for serving the image. What are the benefits to using this API?
- You don’t have to write your own handler for uploaded images
- You don’t have to consume storage quota for saving resized or cropped images, as you can perform transforms on the image simply by appending URL parameters. You only need to store the final URL that is generated by getServingUrl().
- You aren’t charged for datastore CPU for fetching the image (you will still be billed for bandwidth)
- Images are, in general, served from edge server locations which can be geographically located closer to the user
There are a few drawbacks, however, to using the API:
- There aren’t any great schemes for access control of the images, and if someone has the URL for a thumbnail, they can easily remove the parameters to see a larger image
- Billing must be enabled – you will only be charged for usage, however, so you don’t have to spend a cent to use the API. You just have to have billing active.
- Deleting an image blob doesn’t delete the image being served from the URL right away – that image will still be available for some time
- Images must be uploaded to the blobstore, not the datastore as a blob, so it’s important to understand how the blobstore API works
- The URLs of the created images are really, really ugly. If you need pretty URLs, it’s probably a better pattern to create a URL mapping to an HTML page that just displays the image in an IMG tag
Blobstore crash course
It’ll be best if we gave a quick refresher course on the blobstore before we begin. Here’s the standard flow for a blobstore upload:
- Create a new blobstore session and generate an upload URL for a form to POST to. This is done using the createUploadUrl() method of BlobstoreService. Pass a callback URL to this method. This URL is where the user will be forwarded after the upload has completed.
- Present an upload form to the user. The action is the URL generated in step 1. Each URL must be unique: you cannot use the same URL for multiple sessions, as this will cause an error.
- After the URL has uploaded the file, the user is forwarded to the callback URL in your App Engine application specified in step 1. The key of the uploaded blob, a String blob key, is passed as an URL parameter. Save this URL and pass the user to their final destination
Got it? Now we can talk about image serving.
Using the image serving URL
Once we have a blob key (step 3 of a Blobstore upload), we can do interesting things with it. First, we’ll need to create an instance of the ImagesService:
ImagesService imagesService = ImagesServiceFactory.getImagesService();
Once we have an instance, we pass the blob key to getServingUrl and get back a URL:
String imageUrl = imagesService.getServingUrl(blobKey);
This can sometimes take several hundred milliseconds to a few seconds to generate, so it’s almost always a good idea to run this on write as opposed to first read. Subsequent calls should be faster, but they may not be as fast as reading this value from a datastore entity property or memcache. Since this value doesn’t change, it’s a good idea to store it. On the local dev server, this URL looks something like this:
/_ah/img/eq871HJL_bYxhWQbTeYYoA
In production, however, this will return a URL that looks like this:
http://lh5.ggpht.com/2PQk0vDo8Bn8oiPba2gtGlDfd1ciD0H0MLrixcT12FCDQEm2oyMW9ErJX_-ZzOHBWbYBKzevK0BY6cxdZ3cxf_37
(Cute dogs below)
You’ve already saved yourself the trouble of writing a handler. What’s really nice about this URL is that you can perform operations on it just by appending parameters. Let’s say we wanted to crop our image to be no larger than 200×200, yet retain scale. We’d simply append “=s200” to the end of the image:
http://lh5.ggpht.com/2PQk0vDo8Bn8oiPba2gtGlDfd1ciD0H0MLrixcT12FCDQEm2oyMW9ErJX_-ZzOHBWbYBKzevK0BY6cxdZ3cxf_37=s144
(Looks like this)
We can also crop the image by appending a “-c” to the size parameter:
http://lh5.ggpht.com/2PQk0vDo8Bn8oiPba2gtGlDfd1ciD0H0MLrixcT12FCDQEm2oyMW9ErJX_-ZzOHBWbYBKzevK0BY6cxdZ3cxf_37=s144-c
(Looks like this – compare with above)
Note that we can also generate these URLs programmatically using the overloaded version of getServingUrl that also accepts a size and crop parameter.
Adding GWT
So now that we’ve got all that done, let’s get it working with GWT. It’s important that we understand how it all works, because GWT’s single-page, Javascript-generated content model must be taken into account. Let’s draw our upload widget. We’ll be using UiBinder:
We’ll create our Composite class as follows:
public class UploadPhoto extends Composite {
private static UploadPhotoUiBinder uiBinder = GWT.create(UploadPhotoUiBinder.class);
UserImageServiceAsync userImageService = GWT.create(UserImageService.class);
interface UploadPhotoUiBinder extends UiBinder {}
@UiField
Button uploadButton;
@UiField
FormPanel uploadForm;
@UiField
FileUpload uploadField;
public UploadPhoto(final LoginInfo loginInfo) {
initWidget(uiBinder.createAndBindUi(this));
}
}
Here’s the corresponding XML file:
<!DOCTYPE ui:UiBinder SYSTEM "http://dl.google.com/gwt/DTD/xhtml.ent"> <ui:UiBinder xmlns:ui="urn:ui:com.google.gwt.uibinder" xmlns:g="urn:import:com.google.gwt.user.client.ui"> <g:FormPanel ui:field="uploadForm"> <g:HorizontalPanel> <g:FileUpload ui:field="uploadField"></g:FileUpload> <g:Button ui:field="uploadButton"></g:Button> </g:HorizontalPanel> </g:FormPanel> </ui:UiBinder>
(We’ll add more to this later)
When we discussed the Blobstore, we mentioned that each upload form has a different POST location corresponding to the upload session. We’ll have to add a GWT-RPC component to generate and return a URL. Let’s do that now:
// UserImageService.java
@RemoteServiceRelativePath("images")
public interface UserImageService extends RemoteService {
public String getBlobstoreUploadUrl();
}
Our IDE will nag us to generate the corresponding Async interface if we have a GWT plugin:
// UserImageServiceAsync.java
public interface UserImageServiceAsync {
public void getBlobstoreUploadUrl(AsyncCallback callback);
}
We’ll need to write the code on the server side:
// UserImageServiceImpl.java
@SuppressWarnings("serial")
public class UserImageServiceImpl extends RemoteServiceServlet implements UserImageService {
@Override
public String getBlobstoreUploadUrl() {
BlobstoreService blobstoreService = BlobstoreServiceFactory.getBlobstoreService();
return blobstoreService.createUploadUrl("/upload");
}
}
This is pretty straightforward. We’ll want to invoke this service on the client side when we build the form. Let’s add this to UploadPhoto:
public class UploadPhoto extends Composite {
private static UploadPhotoUiBinder uiBinder = GWT.create(UploadPhotoUiBinder.class);
UserImageServiceAsync userImageService = GWT.create(UserImageService.class);
interface UploadPhotoUiBinder extends UiBinder {}
@UiField
Button uploadButton;
@UiField
FormPanel uploadForm;
@UiField
FileUpload uploadField;
public UploadPhoto() {
initWidget(uiBinder.createAndBindUi(this));
// Disable the button until we get the URL to POST to
uploadButton.setText("Loading...");
uploadForm.setEncoding(FormPanel.ENCODING_MULTIPART);
uploadForm.setMethod(FormPanel.METHOD_POST);
uploadButton.setEnabled(false);
uploadField.setName("image");
// Now we use out GWT-RPC service and get an URL
startNewBlobstoreSession();
// Once we've hit submit and it's complete, let's set the form to a new session.
// We could also have probably done this on the onClick handler
uploadForm.addSubmitCompleteHandler(new FormPanel.SubmitCompleteHandler() {
@Override
public void onSubmitComplete(SubmitCompleteEvent event) {
uploadForm.reset();
startNewBlobstoreSession();
}
});
}
private void startNewBlobstoreSession() {
userImageService.getBlobstoreUploadUrl(new AsyncCallback() {
@Override
public void onSuccess(String result) {
uploadForm.setAction(result);
uploadButton.setText("Upload");
uploadButton.setEnabled(true);
}
@Override
public void onFailure(Throwable caught) {
// We probably want to do something here
}
});
}
@UiHandler("uploadButton")
void onSubmit(ClickEvent e) {
uploadForm.submit();
}
}
This is fairly standard GWT RPC.
So that concludes the GWT part of it. We mentioned an upload callback. Let’s implement that now:
/**
* @author Ikai Lan
*
* This is the servlet that handles the callback after the blobstore
* upload has completed. After the blobstore handler completes, it POSTs
* to the callback URL, which must return a redirect. We redirect to the
* GET portion of this servlet which sends back a key. GWT needs this
* Key to make another request to get the image serving URL. This adds
* an extra request, but the reason we do this is so that GWT has a Key
* to work with to manage the Image object. Note the content-type. We
* *need* to set this to get this to work. On the GWT side, we'll take
* this and show the image that was uploaded.
*
*/
@SuppressWarnings("serial")
public class UploadServlet extends HttpServlet {
private static final Logger log = Logger.getLogger(UploadServlet.class
.getName());
private BlobstoreService blobstoreService = BlobstoreServiceFactory
.getBlobstoreService();
public void doPost(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException {
Map blobs = blobstoreService.getUploadedBlobs(req);
BlobKey blobKey = blobs.get("image");
if (blobKey == null) {
// Uh ... something went really wrong here
} else {
ImagesService imagesService = ImagesServiceFactory
.getImagesService();
// Get the image serving URL
String imageUrl = imagesService.getServingUrl(blobKey);
// For the sake of clarity, we'll use low-level entities
Entity uploadedImage = new Entity("UploadedImage");
uploadedImage.setProperty("blobKey", blobKey);
uploadedImage.setProperty(UploadedImage.CREATED_AT, new Date());
// Highly unlikely we'll ever filter on this property
uploadedImage.setUnindexedProperty(UploadedImage.SERVING_URL,
imageUrl);
DatastoreService datastore = DatastoreServiceFactory
.getDatastoreService();
datastore.put(uploadedImage);
res.sendRedirect("/upload?imageUrl=" + imageUrl);
}
}
@Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
String imageUrl = req.getParameter("imageUrl");
resp.setHeader("Content-Type", "text/html");
// This is a bit hacky, but it'll work. We'll use this key in an Async
// service to
// fetch the image and image information
resp.getWriter().println(imageUrl);
}
}
We’ll probably want to display the image we just uploaded in the client. Let’s add a line line of code to register a SubmitCompleteHandler to do this:
public void onSubmitComplete(SubmitCompleteEvent event) {
uploadForm.reset();
startNewBlobstoreSession();
// This is what gets the result back - the content-type *must* be
// text-html
String imageUrl = event.getResults();
Image image = new Image();
image.setUrl(imageUrl);
final PopupPanel imagePopup = new PopupPanel(true);
imagePopup.setWidget(image);
// Add some effects
imagePopup.setAnimationEnabled(true); // animate opening the image
imagePopup.setGlassEnabled(true); // darken everything under the image
imagePopup.setAutoHideEnabled(true); // close image when the user clicks
// outside it
imagePopup.center(); // center the image
}
And we’re done!
Get the code
I’ve got the code for this project here:
http://github.com/ikai/gwt-gae-image-gallery
Just a warning, this is a bit different from the sample code above. I wrote this post after I wrote the code, extrapolating the bare minimum to make this work. The sample code above has experimental tagging, delete and catches logins. I’m adding features to it simply to see what else can be done, so expect changes. I’m aware of a few of the bugs with the code, and I’ll get around to fixing them, but again, it’s a demo project, so keep realistic expectations. As far as I can tell, however, the code above should be runnable locally and deployable (once you have enabled billing for blobstore).
Happy developing!
Using the App Engine Mapper for bulk data import
Since my last post describing App Engine mapreduce, a new InputReader has been added to the Java project for reading from the Blobstore. Nick Johnson wrote a great demo where indexing was done via reading code uploaded to the blobstore. This was demo’d at Google I/O. Now that the library is officially part of the project, it’s become much easier for developers to build Mappers that map across some large, contiguous piece of data as opposed to Entities in the datastore.The most obvious use case is data import. A developer looking to import large amounts of data would take the following steps:
- Create a CSV file containing the data you want to import. The assumption here is that each line of data corresponds to a datastore entity you want to create
- Upload the CSV file to the blobstore. You’ll need billing to be enabled for this to work.
- Create your Mapper, push it live and run your job importing your data.
This isn’t meant to be a replacement for the bulk uploader tool; merely an alternative. This method requires a good amount more programmatic changes for custom data transforms. The advantage of this method is that the work is done on the server side, whereas the bulk uploader makes use of the remote API to get work done. Let’s get started on each of the steps.
Step 1: Create a CSV file with the data you want to upload
We’re going to go through an example of uploading City and State information. MaxMind.com provides a free GeoIP CSV file. The free version isn’t as full featured as the paid version, but it’ll do fine for our demo. Be sure that if you use this file in any kind of production application that you read and understand the license first! For simplicity, we’re going to parse out only cities in the United States using grep. The file should now contain lines that look like this:
605,"US","NY","Valhalla","10595",41.0877,-73.7768,501,914 606,"US","PA","Pittsburgh","15222",40.4495,-79.9880,508,412 607,"US","MO","Bridgeton","63044",38.7667,-90.4201,609,314 608,"US","CA","San Francisco","94124",37.7312,-122.3826,807,415 609,"US","NY","New York","10017",40.7528,-73.9725,501,212 610,"US","PA","Bear Lake","16402",41.9491,-79.4448,516,814 611,"US","NJ","Piscataway","08854",40.5516,-74.4637,501,732 612,"US","NY","Keuka Park","14478",42.5669,-77.1325,555,315 613,"US","VT","Brattleboro","05302",42.8496,-72.6645,506,802
2. Create an upload handler for your CSV file and upload the CSV file
We’re going to create a basic handler for uploading a CSV file and displaying the key. We’ll need to pass this key to our mapper later. There isn’t too much magic here; it’s very similar to the sample code available for the basic blobstore example.
We’ll do a quick overview of the code we need here, but for the purposes of this post, it’s out of scope. We’ll need these files:
upload.jsp
<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
pageEncoding="ISO-8859-1"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<%@page import="com.google.appengine.api.blobstore.BlobstoreService"%>
<%@page import="com.google.appengine.api.blobstore.BlobstoreServiceFactory"%>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Upload your CSV file here</title>
</head>
<body>
<% BlobstoreService blobstoreService = BlobstoreServiceFactory.getBlobstoreService(); %>
<form action="<%= blobstoreService.createUploadUrl("/upload") %>" method="post" enctype="multipart/form-data">
<input type="file" name="data">
<input type="submit" value="Submit">
</form>
</body>
</html>
UploadBlobServlet.java
package com.ikai.mapperdemo.servlets;
import java.io.IOException;
import java.util.Map;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import com.google.appengine.api.blobstore.BlobKey;
import com.google.appengine.api.blobstore.BlobstoreService;
import com.google.appengine.api.blobstore.BlobstoreServiceFactory;
@SuppressWarnings("serial")
public class UploadBlobServlet extends HttpServlet {
public void doPost(HttpServletRequest req, HttpServletResponse resp)
throws IOException {
BlobstoreService blobstoreService = BlobstoreServiceFactory.getBlobstoreService();
Map<String, BlobKey> blobs = blobstoreService.getUploadedBlobs(req);
BlobKey blobKey = blobs.get("data");
if (blobKey == null) {
resp.sendRedirect("/");
} else {
resp.sendRedirect("/upload-success?blob-key=" + blobKey.getKeyString());
}
}
}
SuccessfulUploadServlet.java
package com.ikai.mapperdemo.servlets;
import java.io.IOException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
@SuppressWarnings("serial")
public class SuccessfulUploadServlet extends HttpServlet {
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException {
String blobKey = req.getParameter("blob-key");
resp.setContentType("text/html");
resp.getWriter().println("Successfully uploaded. Download file: <br/>");
resp.getWriter().println(
"<a href='/serve?blob-key=" + blobKey
+ "'>Click to download</a>");
}
}
Source code for this and other helper functions should be available in the Github repository.
Step 3: Create your Mapper
Now we get to the fun part. We need to create our Mapper. A prerequisite for understanding what’s coming next is reading the last post about Mapper I wrote, so check that out before proceeding if you aren’t familiar with Mapper basics. Our Mapper class looks like this:
ImportFromBlobstoreMapper.java
package com.ikai.mapperdemo.mappers;
import java.util.logging.Logger;
import org.apache.hadoop.io.NullWritable;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
import com.google.appengine.tools.mapreduce.BlobstoreRecordKey;
import com.google.appengine.tools.mapreduce.DatastoreMutationPool;
/**
*
* This Mapper imports from a CSV file in the Blobstore. The CSV
* assumes it's in the MaxMind format for cities, states, zipcodes
* and lat/long.
*
*
* @author Ikai Lan
*
*/
public class ImportFromBlobstoreMapper extends
AppEngineMapper<BlobstoreRecordKey, byte[], NullWritable, NullWritable> {
private static final Logger log = Logger.getLogger(ImportFromBlobstoreMapper.class
.getName());
@Override
public void map(BlobstoreRecordKey key, byte[] segment, Context context) {
String line = new String(segment);
log.info("At offset: " + key.getOffset());
log.info("Got value: " + line);
// Line format looks like this:
// 10644,"US","VA","Tazewell","24651",37.0595,-81.5220,559,276
// We're also assuming no errant commas in this simple example
String[] values = line.split(",");
String state = values[2];
String cityName = values[3];
String zipcode = values[4];
Double latitude = Double.parseDouble(values[5]);
Double longitude = Double.parseDouble(values[6]);
state = state.replaceAll("\"", "");
cityName = cityName.replaceAll("\"", "");
zipcode = zipcode.replaceAll("\"", "");
if(!zipcode.isEmpty()) {
Entity zip = new Entity("Zip", zipcode);
zip.setProperty("state", state);
zip.setProperty("city", cityName);
zip.setProperty("latitude", latitude);
zip.setProperty("longitute", longitude);
Entity city = new Entity("City", cityName);
city.setProperty("state", state);
city.setUnindexedProperty("zip", zipcode);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.put(zip);
mutationPool.put(city);
}
}
}
Let’s explain the things in this Mapper that are new:
public class ImportFromBlobstoreMapper extends AppEngineMapper<BlobstoreRecordKey, byte[], NullWritable, NullWritable>
Note this line. It’s different from our previous Mappers in that the type arguments are no longer Key and Entity, but BlobstoreRecordKey and byte[]. The source for BlobstoreRecordKey is here. Remember that map-reduce is about some large body of data and breaking it into smaller pieces to operate on. BlobstoreRecordKey represents a pointer to range of data in our Blobstore. byte[] is a byte[] array actually containing that data.
public void map(BlobstoreRecordKey key, byte[] segment, Context context)
Again, notice the new types. By default, we are splitting on a newline, so segment represents a single line. We can change what we split on by specifying a terminator in mapreduce.xml.
String line = new String(segment);
// Line format looks like this:
// 10644,"US","VA","Tazewell","24651",37.0595,-81.5220,559,276
// We're also assuming no errant commas in this simple example
String[] values = line.split(",");
String state = values[2];
String cityName = values[3];
String zipcode = values[4];
Double latitude = Double.parseDouble(values[5]);
Double longitude = Double.parseDouble(values[6]);
state = state.replaceAll("\"", "");
cityName = cityName.replaceAll("\"", "");
zipcode = zipcode.replaceAll("\"", "");
This is very naive String parsing. Nothing fancy here.
if(!zipcode.isEmpty()) {
Entity zip = new Entity("Zip", zipcode);
zip.setProperty("state", state);
zip.setProperty("city", cityName);
zip.setProperty("latitude", latitude);
zip.setProperty("longitute", longitude);
Entity city = new Entity("City", cityName);
city.setProperty("state", state);
city.setUnindexedProperty("zip", zipcode);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.put(zip);
mutationPool.put(city);
}
Again, very straightforward if you’ve seen this before. Some zipcodes in our CSV file subset are empty, so we’ll check for that and just not create an Entity. We’re adding 2 entities to the mutation pool here – a City and a Zipcode. This ensures that we can search by key when we do a datastore get. Remember that fetches by key are always faster than fetches with a query, since a query requires an index scan followed by a batch get, whereas the datastore can perform a get in a single operation.
That’s it for our Mapper. Let’s add a configuration:
<configuration name="Import all data from the Blobstore">
<property>
<name>mapreduce.map.class</name>
<!-- Set this to be your Mapper class -->
<value>com.ikai.mapperdemo.mappers.ImportFromBlobstoreMapper</value>
</property>
<!-- This is a default tool that lets us iterate over blobstore data -->
<property>
<name>mapreduce.inputformat.class</name>
<value>com.google.appengine.tools.mapreduce.BlobstoreInputFormat</value>
</property>
<property>
<name human="Blob Keys to Map Over">mapreduce.mapper.inputformat.blobstoreinputformat.blobkeys</name>
<value template="optional">blobkeyhere</value>
</property>
<property>
<name human="Number of shards to use">mapreduce.mapper.shardcount</name>
<value template="optional">10</value>
</property>
</configuration>
We’ve changed 2 properties here: the input format class as well as a property for the blobstore key pointing to the data to iterate over.
Step 4: Deploy!
We can now package our application up and deploy it! Make sure that you built a new JAR file with the new classes in appengine-mapreduce! If you have the old JAR file, it won’t include the BlobstoreInputFormat class that we need to do our work.
Step 5: Using the Mapper
Let’s browse to our upload hander at /upload.jsp. The page should be pretty bare.
Once the upload has finished, we’ll be on a page that looks like this:
Let’s copy the blob-key in the URL. It’s not the most streamlined approach but it works. We’ll use it in the next screen when we browser to our mapper:
We’ll copy-paste the key to replace “blobkeyhere” and hit “Run”. And now we play the waiting game – we’ll be able to check on the status of our Mapper in the UI, or check on Tasks, or look in the datastore to see if the data has been imported correctly:
Get the code
The code is here on Github:
http://github.com/ikai/App-Engine-Java-Mapper-API-demos
It’s been updated with the new examples.
Summary
So there you have it: another way of importing data into the datastore. This isn’t a replacement for the bulk uploader, just another option. Here are some useful links for additional information:
App Engine Mapreduce issues tracker – report issues here
Nick Johnson’s post explaining how he built the code search example
One last tip: the best place for questions or discussion is probably the App Engine Discussion Groups, not the comments.
Happy hacking.
Issuing App Engine datastore queries with the Low-Level API
Last time, I wrote an introduction to using the low-level API for creating entities, setting keys, and getting keys by value.
Basic queries and sorts
These are useful when we know the keys, but its often very useful to be able to query entities by their properties. Consider the Person entities we created for the last example, Alice and Bob:
Entity alice = new Entity("Alice", "Person");
alice.setProperty("gender", "female");
alice.setProperty("age", 20);
Entity bob = new Entity(“Person”, “Bob”);
bob.setProperty("gender", "male");
bob.setProperty("age", "23");
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
datastore.put(alice);
datastore.put(bob);
Let’s create a query to find the first 10 Persons that are female and sort them by age ascending. How would we write this?
Query findFemalesQuery = new Query("Person");
findFemalesQuery.addFilter("gender", FilterOperator.EQUAL, "female");
findFemalesQuery.addSort("age", SortDirection.ASCENDING);
datastore.prepare(findFemalesQuery).asList(FetchOptions.Builder.withLimit(10));
Here are the steps we took:
- Created a Query object, specifying the Query kind
- Added a QueryFilter. Note that this is typesafe. We specify the enum representing the FilterOperator we want to use
- Added a QuerySort. Again, like the QueryFilter, we select the property to sort on as well as an enum representing either an ascending order or descending order.
- We prepare the query. On this result we return it as either an Iterator or as a List of Entities. On this method we can either execute the default query, or we can pass a set of options. In the example above, we use FetchOptions.Builder to set the only option we care about: the limit. We only want 10, so we call withLimit() and pass it 10.
The query interface works well because it’s typesafe where the datastore is typesafe, and not so when the datastore is not – you won’t get errors at runtime because you misspelled “WHERE”, for instance, but you have to be careful not to misspell the properties you are looking for. The flexibility of this interface means that no longer are we constrained by the “every object must have the same bag of properties” frame of thinking. Furthermore, because we don’t need to know the property names apriori (we can use getProperties() and return a Map), we can iterate through this and figure out the keys/value pairs at runtime. This leads to some very powerful abstractions.
Doing a keys only query
It sometimes makes sense for us to only retrieve the keys in a given query. It’s actually incredibly easy, so as long as we know what to expect:
Query findFemalesQuery = new Query("Person");
findFemalesQuery.addFilter("gender", FilterOperator.EQUAL, "female");
findFemalesQuery.addSort("age", SortDirection.ASCENDING);
findFemalesQuery.setKeysOnly();
List<Entity> results = datastore.prepare(findFemalesQuery).asList(
FetchOptions.Builder.withLimit(10));
The only code that’s different in creating the Query object is that we call setKeysOnly(). This still returns a List of entity objects with only the Kind and Key populated. If we wrote a test for this, it would look like this:
Entity alice = results.get(0);
assertEquals("Return Key for Entity", KeyFactory.createKey("Person", "Alice"), alice.getKey());
assertNull("Should not return female property", alice.getProperty("gender"));
assertEquals("Returns Entities with no properties", 0, alice.getProperties().size());
Only the Kind and Key are populated in these Entity objects. Even though the API looks similar, under the hood, the behavior is completely different. Recall how queries work underneath the hood:
- Traverse an index and retrieve keys
- Using those keys, fetch the entities from the datastore
The time to do a query depends on the index traversal time as well as the number of entities to retrieve. In a keys only query, this is what happens:
- Traverse an index and retrieve keys
We completely eliminate step 2 from the process. If all we want is Key information or are counting entities (and the count can be done using only indexes), this is the approach we would take.
Ancestor Queries
Let’s pretend Alice and Bob have child entities:
Entity madHatter = new Entity("Friend", "Mad Hatter", alice.getKey());
Entity doormouse = new Entity("Friend", "Doormouse", alice.getKey());
Entity chesireCat = new Entity("Friend", "Chesire Cat", alice.getKey());
Entity redQueen = new Entity("Friend", "Red Queen", bob.getKey());
datastore.put(madHatter);
datastore.put(doormouse);
datastore.put(chesireCat);
datastore.put(redQueen);
Alice now has Friends Mad Hatter, Doormouse and the Chesire Cat as child entities, while Bob has on the Red Queen. How do we find all friends of Alice or Bob? Like so:
Query friendsOfAliceQuery = new Query("Friend");
friendsOfAliceQuery.setAncestor(alice.getKey());
List<Entity> results = datastore.prepare(friendsOfAliceQuery).asList(FetchOptions.Builder.withDefaults());
Query friendsOfBobQuery = new Query("Friend");
friendsOfBobQuery.setAncestor(bob.getKey());
results = datastore.prepare(friendsOfBobQuery).asList(FetchOptions.Builder.withDefaults());
What’s great about these queries is that the datastore knows exactly where to start. Because keys embed parent Key information – Mad Hatter, Doormouse and the Chesire Cat all have “Alice” as a prefix in their key (this is also why you cannot change an entity’s entity group after creation), we know that we just need to start the query from Alice’s Key and just traverse entities with a Key greater than Alice. It’s also a great way of organizing data. Just be aware that too many transactions on a single entity group will destroy your throughput, so design for as small entity groups as possible.
Summary
Hopefully this blog post explains a few more features of the low-level API. Understanding the low-level API is an important step in understanding the datastore, and understanding the datastore is a critical step for learning how to build efficient, optimized applications for App Engine.
Google App Engine Tips and Tricks: Prebuilding Indexes using a non default version
(This’ll be a shorter post than usual.)
Waiting for indexes to build can be drag; indexes need to be built before Entities even exist and can take longer than needed if the global index building workflow is backed up since mass building is a shared resource.
One little known trick is to pre-build indexes before your application needs them by deploying a non-default version. Your application can have many versions. In Java App Engine, this is defined in the version tag of appengine-web.xml. In Python, this is defined in the version YAML element. The Java Eclipse plugin even has a screen where the version can be set (Click the App Engine icon, then “Project Settings”:
Because all applications share the same datastore, the required indexes will be built once your push your application containing the indexes configuration file with the new, required indexes. Hopefully, by the time you are ready to push your real version, the indexes will have completed building.
In general, it is a best practice to maintain a staging version of your application for testing against live data. App Engine makes this so easy it’s trivial: deploy code tagged with a new “version”. Your application is accessible at http://VERSION.latest.APPID.appspot.com (note that VERSION is a String, not a integer or decimal number) – this is a handy and powerful trick to validating a new test or staging version. When you have enough confidence in your application, browse to the Admin Console, click the radio button associated with your new version, and click “Make Default.”
Versioning has never been so easy. No configuring load balancers, rolling deploys, symlinking, restarting edge caches, etc.
Happy hacking.
Using the Java Mapper Framework for App Engine
Introduction to Map Reduce
If you aren’t familiar with Map Reduce, read more about it from a high level from Wikipedia here. The official paper can be downloaded from this site if you’re interested in a more technical discussion.
The simplest breakdown of MapReduce is as follows:
- Take a large dataset and break it into pieces, mapping individual pieces of data
- Work on those mapped datasets and reduce them into the form you need
A simple example here is full text indexing. Suppose we wanted create indexes from existing text documents. We would use the Map step to iterate over every document and “map” each phrase or term to a document, then we would “reduce” the mappings by writing them to an index. Map/reduce problems have the advantage of not only being easy to conceptualize as problems that can be distributed and parallelized, but also because there are frameworks that support many of the administrative functions of map-reduce: failure recovery, distribution of work, tracking status of jobs, reporting and so forth. The appengine-mapreduce project seeks to provide as many of these features as possible while making it as easy as possible for developers to write large batch processing jobs without having to think about the plumbing details.
But I only have Map available!
Yes, this is true - as of the writing of this post, only the “map” step exists, hence why it’s currently referred to as the “Mapper API”. That doesn’t mean it’s not useful. For starters, it is a very easy way to perform some operation on every single Entity of a given Kind in your datastore in parallel. What would you have to build for yourself if Mapper weren’t available?
- Begin querying over every Entity in chained Task Queues
- Store beginning and end cursors (introduced in 1.3.5)
- Create tasks to work with chunks of your datastore
- Write the code to manipulate your data
- Build an interface to control your batch jobs
- Build a callback system for your multitudes of parallelized workers to call when the entire task has completed
It’s certainly not a trivial amount of work. Some things you can do very easily with the Mapper library include:
- Modify some property or set of properties for every Entity of a given Kind
- Delete all entities of a single Kind – the functional equivalent of a “DROP TABLE” if you were using a relational database
- Count the occurrences of some property across every single Entity of a given Kind in your datastore
We’ll go through a few of these examples in this post.
Our Sample application
How to define a Mapper
There are three steps to defining a Mapper:
- Download, build and place the appengine-mapreduce JAR files in your WEB-INF/lib directory and add them to your build path. You only need to do this once per project. The steps for doing this are on the “Getting Started” page for Java. You’ll need all the JAR files that are built.
- Make sure that we have a DESCENDING index created on Key. This is important! If we run our Mapper locally, this’ll automatically be created in our datastore-indexes.xml file when we deploy our application. One trick to ensure that indexes get built before they are needed, at least in a live application, is to create and deploy an application with the new index configuration to a non-default version. Because all versions use the same datastore and the same set of indexes, this will schedule the index to be built before we need it in the live version. When it has completed, we simply switch the default version over, and we’re ready to roll.
- Create your Mapper class
- Configure your Mapper class in mapreduce.xml
We’ll go over steps 3 and 4 in each example.
Example 1: Changing a property on every Entity (Naive way)
(You can even use this technique if you just need to change a property on a large set of Entities).
Assuming you’ve already set up your environment for the Mapper servlet, you can dive right in. Let’s create a Mapper classes that goes through every Entity of a given Kind and converts the “comment” property to use all lowercase letters. We’ll also add a timestamp for when we modified this Entity. In this first example, we’ll do this the naive way. This is a very good way to introduce you to very simple mutations on all your Entities using Mapper.
Note that this requires some familiarity with the Low-Level API. Don’t worry – entities edited or saved using the low-level API are accessible via managed persistence interface such as JDO/JPA (and vice versa). If you aren’t familiar with the low-level API, you can read more about it here on the Javadocs.
The first thing we’ll have to do is define a Mapper. We tried as much as possible to mimic Hadoop’s Mapper class. We’ll be subclassing AppEngineMapper, which is itself a subclass of Hadoop’s Mapper. The meat of this class is the map() method, which we’ll be overriding. We’ll also override the taskSetup() lifecycle callback. We’ll be using this to initialize our DatastoreService, though we could probably initialize it in the body of the map() method itself. The other methods are taskCleanup(), setup() and cleanup() – examples here. Let’s have a look at our code below:
package com.ikai.mapperdemo.mappers;
import java.util.Date;
import java.util.logging.Logger;
import org.apache.hadoop.io.NullWritable;
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
/**
*
* This mapper changes all Strings to lowercase Strings, sets
* a timestamp, and reputs them into the Datastore. The reason
* this is a "Naive" Mapper is because it doesn't make use of
* Mutation Pools, which can do these operations in batch instead
* of individually.
*
* @author Ikai Lan
*
*/
public class NaiveToLowercaseMapper extends
AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
private static final Logger log = Logger
.getLogger(NaiveToLowercaseMapper.class.getName());
private DatastoreService datastore;
@Override
public void taskSetup(Context context) {
this.datastore = DatastoreServiceFactory.getDatastoreService();
}
@Override
public void map(Key key, Entity value, Context context) {
log.info("Mapping key: " + key);
if (value.hasProperty("comment")) {
String comment = (String) value.getProperty("comment");
comment = comment.toLowerCase();
value.setProperty("comment", comment);
value.setProperty("updatedAt", new Date());
datastore.put(value);
}
}
}
Notice that this map method takes 3 parameters:
Key key – this is the datastore Key for the Entity we are about to perform an operation on. Mostly this exists for API compatibility with Hadoop, but we don’t really need it yet. For iterating over datastore Entities, we don’t really need this, because we *could* use this to look up the Entity, but we don’t have to because …
Entity value – … because we actually get the Entity already. If we did a lookup for the Entity, we’d double the amount of lookups we do per Entity. We can certainly use the Key to do a lookup using a PersistenceManager or EntityManager and have a populated, typesafe Entity object, but from an efficiency standpoint we’d be doubling our work for some JDO/JPA sugar.
Context context – We don’t need this in our example, but it’s easy to think of the Context as giving us access to “global” values such as temporary variables and configuration files. For a later example in this post, we’ll be using the Context to store a global value in a counter and increment it. For this example, it’s unused.
If you’re familiar at all with the low-level API, this will look very straightfoward (again, I highly encourage you to read the docs). We take an entity, add 2 properties to it, then re-put() the Entity back into the datastore.
Now let’s add this job to mapreduce.xml:
<configurations>
<configuration name="Naive Mass toLowercase()">
<property>
<name>mapreduce.map.class</name>
<!-- Set this to be your Mapper class -->
<value>com.ikai.mapperdemo.mappers.NaiveToLowercaseMapper</value>
</property>
<!-- This is a default tool that lets us iterate over datastore entities -->
<property>
<name>mapreduce.inputformat.class</name>
<value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
</property>
<property>
<name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
<value template="optional">Comment</value>
</property>
</configuration>
</configurations>
It looks complex, but it’s really not. We define a configuration element and name the job. The name of the job is also the name we’ll see in the GUI when we fire off the job. We need 3 sets of property elements under this element, which are just name/value pairs. Let’s go over each one we used:
Name: mapreduce.map.class
Value: com.ikai.mapperdemo.mappers.NaiveToLowercaseMapper
This one is straightforward – we provide the name of an AppEngineMapper subclass with the map() method we want run.
Name: mapreduce.inputformat.class
Value: com.google.appengine.tools.mapreduce.DatastoreInputFormat
This is a class that takes some input to map over. DatastoreInputFormat is provided by appengine-mapreduce, but it is possible for us to define our own input formatter. For guidance, check out the source of DatastoreInputFormat here.
In a more advanced example (ahem, future blog post), we’ll discuss building our own InputFormat to read from another source such as the Blobstore. For our examples in this post, we won’t need anything beyond DatastoreInputFormat.
Name: mapreduce.mapper.inputformat.datastoreinputformat.entitykind
Value: Comment
This input is specific to DatastoreInputFormat. It tells DatastoreInputFormat which Entity Kind to iterate over. Note that in the mapper console, a user can type in the name of a Kind or edit this Field to reflect the value they want. We can’t leave this blank, though, if we want this to work.
If we browse to the URI at which we’ve defined the Mapper console (in our case /mapper), we see something that looks like this:
“Running jobs” appears when we click “Run”. We can click “Detail” to see the progress of our job, or we can “Abort” to quit the job. Note that aborting a job won’t revert our Entities! We’ll end up with a partially run job if we run a giant mutation, so we’ll have to be cognizant of this when we use this tool.
When the job completes, we’ll take a look at our Comments. Sure enough, they are now all lowercase.
Example 2: Changing a property on every Entity using Mutation Pools
There’s a reason the Mapper in Example 1 is called a Naive Mapper: because it doesn’t take advantage of mutation pools. As we all know, App Engine’s datastore is capable of handling operations in parallel using batched calls. We’re already doing work in parallel by specifying shards, but we’ll want to use batched calls when possible. We do this by adding the mutations we want to a mutation pool, then, periodically as the pool hits a certain size, we flush all the writes to the datastore with a single call instead of individually. This has the advantage of making our map() call as fast as possible, since all we’re really doing is making a list of operations to perform all at once when the system is good and ready. Let’s define the XML file first assuming we call the class PooledToLowercaseMapper:
<configuration name="Mass toLowercase() with Mutation Pool">
<property>
<name>mapreduce.map.class</name>
<!-- Set this to be your Mapper class -->
<value>com.ikai.mapperdemo.mappers.PooledToLowercaseMapper</value>
</property>
<!-- This is a default tool that lets us iterate over datastore entities -->
<property>
<name>mapreduce.inputformat.class</name>
<value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
</property>
<property>
<name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
<value template="optional">Comment</value>
</property>
</configuration>
It looks almost exactly the same. That’s because the meat is in what we do in the actually class itself:
package com.ikai.mapperdemo.mappers;
import java.util.Date;
import java.util.logging.Logger;
import org.apache.hadoop.io.NullWritable;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
import com.google.appengine.tools.mapreduce.DatastoreMutationPool;
/**
*
* The functionality of this is exactly the same as in {@link NaiveToLowercaseMapper}.
* The advantage here is that since a {@link DatastoreMutationPool} is used, mutations
* can be done in batch, saving API calls.
*
* @author Ikai Lan
*
*/
public class PooledToLowercaseMapper extends
AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
private static final Logger log = Logger
.getLogger(PooledToLowercaseMapper.class.getName());
@Override
public void map(Key key, Entity value, Context context) {
log.info("Mapping key: " + key);
if (value.hasProperty("comment")) {
String comment = (String) value.getProperty("comment");
comment = comment.toLowerCase();
value.setProperty("comment", comment);
value.setProperty("updatedAt", new Date());
DatastoreMutationPool mutationPool = this.getAppEngineContext(
context).getMutationPool();
mutationPool.put(value);
}
}
}
Everything looks example the same until these lines:
DatastoreMutationPool mutationPool = this.getAppEngineContext(context).getMutationPool(); mutationPool.put(value);
Aha! So we finally put the context to use. Granted, we use the context as a parameter to another, more useful method, but at least we’re using it. We acquire a DatastoreMutationPool using the getAppEngineContext(context).getMutationPool() method, then we just call put() and pass the changed entity. DatastoreMutationPool is defined here and is open source.
The interface is similar to that of DatastoreService. There’s not a lot of fancy stuff going on here. put(), as we’ve seen, is defined. get() isn’t, because, well, that method makes no sense in this context. delete() is defined, which brings me to my bonus section:
Bonus Example 2: Delete all Entities of a given Kind
One of the most common questions asked in the group is, “How do I drop table?” Usually, this question is asked by new App Engine developers who don’t yet understand that the datastore is a distributed key-value store and not a relational database. But it’s also a legitimate use case. What if you just wanted to nuke all Entities of a given Kind? Prior to Mapper, you would have had to write your own handler to take care of this. Mapper makes this very easy. Here’s what a generic “DeleteAllMapper” would look like. This will work with *any* Entity Kind:
package com.ikai.mapperdemo.mappers;
import java.util.logging.Logger;
import org.apache.hadoop.io.NullWritable;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
import com.google.appengine.tools.mapreduce.DatastoreMutationPool;
/**
*
* This Mapper deletes all Entities of a given kind. It simulates the
* DROP TABLE functionality asked for by developers.
*
* @author Ikai Lan
*
*/
public class DeleteAllMapper extends
AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
private static final Logger log = Logger.getLogger(DeleteAllMapper.class
.getName());
@Override
public void map(Key key, Entity value, Context context) {
log.info("Adding key to deletion pool: " + key);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.delete(value.getKey());
}
}
That’s it! We wire it up the same way we wire up other Mappers:
<configuration name="Delete all Entities">
<property>
<name>mapreduce.map.class</name>
<!-- Set this to be your Mapper class -->
<value>com.ikai.mapperdemo.mappers.DeleteAllMapper</value>
</property>
<!-- This is a default tool that lets us iterate over datastore entities -->
<property>
<name>mapreduce.inputformat.class</name>
<value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
</property>
<property>
<name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
<value template="optional">Comment</value>
</property>
</configuration>
I’ve separated each out into its own mapreduce.xml, but this isn’t necessary. A given App Engine project can have multiple configuration elements defined. That’s why there’s a dropdown list in the Mapreduce console GUI.
Example 3: Taking more user input in the Mapper console and counting
Our next example covers using counters in the context. Let’s say that we wanted to allow the User to enter a String, then we iterate over every Entity searching for occurrences of that Substring on-the-fly and not with pre-built indexes. First, let’s discuss the XML configuration we use:
<configuration name="Count words in all Comments">
<property>
<name>mapreduce.map.class</name>
<!-- Set this to be your Mapper class -->
<value>com.ikai.mapperdemo.mappers.CountWordsMapper</value>
</property>
<property>
<!-- This is the URL to call after the entire Mapper has run -->
<name>mapreduce.appengine.donecallback.url</name>
<value>/callbacks/word_count_completed</value>
</property>
<!-- This is a default tool that lets us iterate over datastore entities -->
<property>
<name>mapreduce.inputformat.class</name>
<value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
</property>
<property>
<name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
<value template="optional">Comment</value>
</property>
</configuration>
There’s one new name/value pair:
Name: mapreduce.mapper.counter.substringtarget
Value: Substring
We can pick any name or value we want. We just pick this one because it makes sense. We’ll retrieve this value in the Mapper via the Context. This causes an extra text field to appear in the Mapper console:
The Mapper is below:
package com.ikai.mapperdemo.mappers;
import java.util.logging.Logger;
import org.apache.hadoop.io.NullWritable;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
/**
*
* This Mapper takes some input and counts the number of Comments which
* contain that substring.
*
* @author Ikai Lan
*
*/
public class SubstringMatcherMapper extends
AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
private static final Logger log = Logger.getLogger(SubstringMatcherMapper.class
.getName());
/*
* Get the target that we want to match on and count the number of Comments that
* match it
*/
@Override
public void map(Key key, Entity value, Context context) {
String substringToMatch = context.getConfiguration().get("mapreduce.mapper.counter.substringtarget");
String comment = (String) value.getProperty("comment");
if (comment != null) {
if(comment.contains(substringToMatch)) {
log.info("Found match in: " + comment);
context.getCounter("SubstringMatch", "count").increment(1);
}
}
}
}
We retrieve the value entered by the user with this line of code:
context.getConfiguration().get("mapreduce.mapper.counter.substringtarget");
If the comment we’re current working on contains the substring, we want to increment our count. The context object has a getCounter() method that returns a counter we can increment or decrement:
context.getCounter("SubstringMatch", "count").increment(1);
When our job completes running, we can see the total count when we click “Detail” on the completed job:
More likely than not, however, we’ll want to store this number back in the datastore or do something with it besides stick it into a status page. Good that we mention that …
Example 4: Completion callbacks and JobContexts
Let’s modify Example 3 a bit. Suppose now we want to count the total number of words across all comments. We’ll need to use a counter. But suppose that instead of just displaying it in a console page, we want that number to get stored into the datastore again. Much like Task Queues, incoming email and XMPP, the callback is event driven, and therefore uses the HTTP request to an app URI model to dispatch. That is – we’ll define a servlet with a doPost() handler and read the input out of the parameters.
The first thing we’ll need to do is configure our Mapper to fire off the callback when done. We do this in mapreduce.xml:
<configuration name="Count substring matches in all Comments">
<property>
<name>mapreduce.map.class</name>
<!-- Set this to be your Mapper class -->
<value>com.ikai.mapperdemo.mappers.SubstringMatcherMapper</value>
</property>
<!-- This is a default tool that lets us iterate over datastore entities -->
<property>
<name>mapreduce.inputformat.class</name>
<value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
</property>
<property>
<name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
<value template="optional">Comment</value>
</property>
<property>
<name human="Search for substring">mapreduce.mapper.counter.substringtarget</name>
<value template="optional">Substring</value>
</property>
</configuration>
Here’s the property we care about:
Name: mapreduce.appengine.donecallback.url
Value: /callbacks/word_count_completed
The value of this can map to any URI in your application. Just be sure that URI points to the Servlet that will be handling your callback. Let’s define the Mapper class:
package com.ikai.mapperdemo.mappers;
import java.util.logging.Logger;
import org.apache.hadoop.io.NullWritable;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.tools.mapreduce.AppEngineMapper;
/**
*
* This mapper counts the number of total words across all comments. It cheats a
* bit by just splitting on whitespace and just using the length. This mapper
* demonstrates use of counters as well as using a completion callback.
*
* @author Ikai Lan
*
*/
public class CountWordsMapper extends
AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
private static final Logger log = Logger.getLogger(CountWordsMapper.class
.getName());
/*
* This is a bit of a lazy implementation more to prove a point than to
* actually be correct. Split on whitespace, count words
*/
@Override
public void map(Key key, Entity value, Context context) {
String comment = (String) value.getProperty("comment");
if (comment != null) {
String[] words = comment.split("\\s+");
int wordCount = words.length;
// Takes a "group" and a "counter"
// We'll use these later to store the final count back in the
// datastore
context.getCounter("CommentWords", "count").increment(wordCount);
}
}
}
Not a lot that’s new here. We use the context again to store a counter. Note that we can increment by any value, not just 1.
Let’s take a look at what our servlet looks like that handles this callback:
package com.ikai.mapperdemo.servlets;
import java.io.IOException;
import java.util.Date;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.CounterGroup;
import org.apache.hadoop.mapreduce.Counters;
import org.apache.hadoop.mapreduce.JobID;
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.EntityNotFoundException;
import com.google.appengine.tools.mapreduce.MapReduceState;
import com.ikai.mapperdemo.mappers.CountWordsMapper;
/**
* This is the servlet that takes care of any processing we have to do after we
* have finished running {@link CountWordsMapper}.
*
* This is just a standard servlet - we can do anything we want here. We can use
* any App Engine API such as email or XMPP, for instance, to notify an
* administrator. We could also store a final state into the datastore - in
* fact, that is what this example below does.
*
* @author Ikai Lan
*
*/
@SuppressWarnings("serial")
public class WordCountCompletedCallbackServlet extends HttpServlet {
public void doPost(HttpServletRequest req, HttpServletResponse resp)
throws IOException {
String jobIdName = req.getParameter("job_id");
JobID jobId = JobID.forName(jobIdName);
// A future iteration of this will likely contain a default
// option if we don't care which DatastoreService instance we use.
DatastoreService datastore = DatastoreServiceFactory
.getDatastoreService();
try {
// We get the state back from the job_id parameter. The state is
// serialized and stored in the datastore, so we pass an instance
// of the datastore service.
MapReduceState mrState = MapReduceState.getMapReduceStateFromJobID(
datastore, jobId);
// There's a bit of ceremony to get the actual counter. This
// example is intentionally verbose for clarity. First get all the
// Counters,
// then we get the CounterGroup, then we get the Counter, then we
// get the count.
Counters counters = mrState.getCounters();
CounterGroup counterGroup = counters.getGroup("CommentWords");
Counter counter = counterGroup.findCounter("count");
long wordCount = counter.getValue(); // Finally!
// Let's create a special datastore Entity for this value so
// we can reference it on the ViewComments page
Entity totalCountEntity = new Entity("TotalWordCount",
"total_word_count");
totalCountEntity.setProperty("count", wordCount);
// Now we timestamp this bad boy
totalCountEntity.setProperty("updatedAt", new Date());
datastore.put(totalCountEntity);
} catch (EntityNotFoundException e) {
throw new IOException("No datastore state");
}
}
}
The JobID comes as a String parameter. We get it like so:
String jobIdName = req.getParameter("job_id");
JobID jobId = JobID.forName(jobIdName);
Be aware of the imports used. Your IDE may import the wrong class, as there is a deprecated JobID and a non-deprecated version.
Once you have the JobID, you use it to retrieve the MapReduceState:
MapReduceState mrState = MapReduceState.getMapReduceStateFromJobID(datastore, jobId);
From the MapReduceState object, we have to perform a bit of a ceremony to get what we want. We need to:
1. Fetch the Counters from the MapReduceState
2. Fetch the appropriate CounterGroup from the Counters object
3. Fetch the named Counter from the CounterGroup
4. Fetch the value from the Counter
Here’s what it looks like in code:
Counters counters = mrState.getCounters();
CounterGroup counterGroup = counters.getGroup("CommentWords");
Counter counter = counterGroup.findCounter("count");
long wordCount = counter.getValue();
We can now do what we want with this count. In our servlet example, we save it to a datastore Entity and use it later on.
Get the code
You’re undoubtedly ready to start playing with this thing. You’ve got everything you need to know. First, here’s the getting started page for appengine-mapreduce in Java:
Here’s my sample source code on GitHub.
Summary
So there you have it: an easy to use tool for mapping operations across entire Entity Kinds. There are still a lot of topics to cover, and we’ll likely explore them in a future article. For instance, I didn’t have a chance to cover building your own InputFormat class. We’re still hard at work extending this framework (such as the “Shuffle” and “Reduce” phases), so please post your feedback in the App Engine groups or file bugs in the issue tracker.



