So, after watching a few cool videos from Google I/O 2011 & 2012, and reading a bit of documentation, I wanted to try to do the following in Python:
- Use Google App Engine to run a MapReduce job to transform (i.e., ETL) some data in parallel via App Engine’s MapReduce and Pipelines API into a CSV format compatible with Google Big Query.
- Store the transformed results of my MapReduce job in Google Cloud Storage
- Ingest (i.e., store) the transformed results in a new table inside of a Big Query dataset
- Utilize BigQuery to run blazingly fast queries across my data.
The only issue was easily finding a working “hello world” sort of code sample that would teach me how to do all of that. After a bit of searching, I found what I was looking for, and quickly was able to do what I wanted. Here are the links in case anyone is interested:
- Sample Python code from Google’s open source code repository:
- A tutorial that walks you through that sample code step by step at a high level
- Caveat: There were one or two typos in the tutorial’s version of the code, so rely on the code in the repository when in doubt