MapReduce With MongoDB 1.8 and Java

| Comments

In my last post, I introduced the new MapReduce features introduced in MongoDB 1.8, which is now available as a release candidate. Most importantly the temporary collection system has gone away, now requiring that you specify an output parameter. With that required output comes new options for how to create incremental output using the merge and reduce output modes.

As I write this, we are prepping new releases of our Java Driver (v2.5) and our Scala Driver, Casbah (v2.1) which are intended to support MongoDB 1.8’s new features including incremental MapReduce. Since I implemented the APIs for the new MapReduce output in both drivers, I thought I’d demonstrate the application of these new output features to the previous dataset. This post is focused on the Java API, but a Scala one will likely follow.

As a reminder (or a primer for those who skipped my last post), I’ve been testing the 1.8 MapReduce using a dataset and MapReduce job originally created to test the MongoDB+Hadoop Plugin. It consists of daily U.S. Treasury Yield Data for about 20 years; the MapReduce task calculates an annual average for each year in the collection. You can grab a copy of the entire collection in a handy mongoimport friendly datadump from the MongoDB+Hadoop repo; here’s a quick snippet of it:

{ "_id" : ISODate("1990-01-10T00:00:00Z"), "dayOfWeek" : "WEDNESDAY", "bc3Year" : 7.95, "bc5Year" : 7.92, "bc10Year" : 8.03, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.91, "bc3Month" : 7.75, "bc30Year" : 8.11, "bc1Year" : 7.77, "bc7Year" : 8, "bc6Month" : 7.78 }
{ "_id" : ISODate("1990-01-11T00:00:00Z"), "dayOfWeek" : "THURSDAY", "bc3Year" : 7.95, "bc5Year" : 7.94, "bc10Year" : 8.04, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.91, "bc3Month" : 7.8, "bc30Year" : 8.11, "bc1Year" : 7.77, "bc7Year" : 8.01, "bc6Month" : 7.8 }
{ "_id" : ISODate("1990-01-12T00:00:00Z"), "dayOfWeek" : "FRIDAY", "bc3Year" : 7.98, "bc5Year" : 7.99, "bc10Year" : 8.1, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.93, "bc3Month" : 7.74, "bc30Year" : 8.17, "bc1Year" : 7.76, "bc7Year" : 8.07, "bc6Month" : 7.8100000000000005 }
{ "_id" : ISODate("1990-01-16T00:00:00Z"), "dayOfWeek" : "TUESDAY", "bc3Year" : 8.13, "bc5Year" : 8.11, "bc10Year" : 8.2, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 8.1, "bc3Month" : 7.89, "bc30Year" : 8.25, "bc1Year" : 7.92, "bc7Year" : 8.18, "bc6Month" : 7.99 }

A Look at MongoDB 1.8’s MapReduce Changes

| Comments

MongoDB 1.7.5 shipped yesterday, and is expected to be the last ‘beta’ release of what will become MongoDB 1.8. As part of the release, I’ve been doing testing of the new MapReduce functionality and thought this a good time to highlight those changes for people.

If you aren’t new to MongoDB MapReduce, the most important thing to note since MongoDB 1.6.x is that temporary collections are gone; it is now required to specify an output. Previously, if you omitted the out argument MongoDB would create a temporary collection and return its name with the job results; In non-sharded MongoDB setups these temporary collections would go out of scope and be cleaned up when the connection closed. Unfortunately, for sharded setups it wasn’t possible to safely clean these up––they would remain behind and clutter up the database. For this and other reasons the temporary collection feature was removed. There is good news though: they’ve been replaced with an even better system for saving the results of MapReduce jobs!

While the out argument is now a required parameter in MapReduce jobs, it has a number of options for controlling what MongoDB does with results. If you’re running a truly one-off job where you don’t need to keep the results later, MongoDB now supports returning results “inline”. Be careful here though: your results are being returned in a single document and are subject to the document size limitations of MongoDB (16MB per document in 1.8). To use inline results, set the value of out to a document {inline: 1}. The result object will contain an additional key results which contains the MapReduce output; the result field will be omitted.

As with previous versions of MongoDB, you can specify a collection name (as a string) in the out argument. If the named collection already exists MongoDB will replace it entirely with the MapReduce results. Along with the inline mode, MongoDB 1.8 introduces support for “merge” and “reduce” output modes; instead of replacing the target collection MongoDB can be instructed to reconcile the MapReduce results with the existing data. To use these modes, set the value of out to a document with a key of either “merge” or “reduce” and a value of the collection to save to.

The difference in “merge” and “reduce” has to do with MongoDB does when it encounters duplicate keys in both the existing collection and the MapReduce results. In “merge” mode, MongoDB will simply overwrite the existing key with the new one from the MapReduce output. In “reduce” mode, MongoDB will run the reduce function again with both the new and old data, saving those results to the collection (you remembered to make your reduce function idempotent, right?). UPDATE: If you specified a “finalize” function, MongoDB will re-run this after the “reduce” runs.

Now that I’ve thoroughly confused you, lets dig into examples of each of these behaviors. I’ve been testing the 1.8 MapReduce using a dataset and MapReduce job originally created to test the MongoDB+Hadoop Plugin. It consists of daily U.S. Treasury Yield Data for about 20 years; the MapReduce task calculates an annual average for each year in the collection. You can grab a copy of the entire collection in a handy mongoimport friendly datadump from the MongoDB+Hadoop repo; here’s a quick snippet of it:

{ "_id" : ISODate("1990-01-10T00:00:00Z"), "dayOfWeek" : "WEDNESDAY", "bc3Year" : 7.95, "bc5Year" : 7.92, "bc10Year" : 8.03, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.91, "bc3Month" : 7.75, "bc30Year" : 8.11, "bc1Year" : 7.77, "bc7Year" : 8, "bc6Month" : 7.78 }
{ "_id" : ISODate("1990-01-11T00:00:00Z"), "dayOfWeek" : "THURSDAY", "bc3Year" : 7.95, "bc5Year" : 7.94, "bc10Year" : 8.04, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.91, "bc3Month" : 7.8, "bc30Year" : 8.11, "bc1Year" : 7.77, "bc7Year" : 8.01, "bc6Month" : 7.8 }
{ "_id" : ISODate("1990-01-12T00:00:00Z"), "dayOfWeek" : "FRIDAY", "bc3Year" : 7.98, "bc5Year" : 7.99, "bc10Year" : 8.1, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.93, "bc3Month" : 7.74, "bc30Year" : 8.17, "bc1Year" : 7.76, "bc7Year" : 8.07, "bc6Month" : 7.8100000000000005 }
{ "_id" : ISODate("1990-01-16T00:00:00Z"), "dayOfWeek" : "TUESDAY", "bc3Year" : 8.13, "bc5Year" : 8.11, "bc10Year" : 8.2, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 8.1, "bc3Month" : 7.89, "bc30Year" : 8.25, "bc1Year" : 7.92, "bc7Year" : 8.18, "bc6Month" : 7.99 }

Exploring Scala With MongoDB

| Comments

2010 proved to be a great year for growth and adoption of many fledgling technologies—not least among them, MongoDB and Scala. Scala is designed as an alternative language for the Java platform, with a focus on scalability. It merges many of the Object Oriented concepts of languages like Java and C++ with the functional tools of Erlang, Haskell and Lisp with a bit of the dynamic natures of modern languages like Ruby and Python. This flexible nature has sped Scala’s adoption in the technology stacks of platforms like LinkedIn, Twitter, FourSquare and many more. By running on the JVM Scala has a strong affinity for working alongside existing Java applications, which allows users to build on their existing technology investments.

For 2011, MongoDB has added official support for Scala with the release of Casbah, a Scala driver for MongoDB. Casbah is built around the existing MongoDB Java Driver to give it a strong foundation, but designed to take advantage of many of the idioms of Scala such as a strong collections library, fluid syntax for building DSLs and functional concepts like closures and currying.

Because it is designed to be easy to work with for Scala users, Casbah introduces a more ‘friendly’ syntax for creating MongoDB Objects, using Scala’s Map syntax:

import com.mongodb.casbah.Imports._

/** Create an object directly */
val newObj = MongoDBObject("foo" -> "bar",
                           "x" -> "y",
                           "pie" -> 3.14,
                           "spam" -> "eggs")

/** Or, use a builder interface */
val builder = MongoDBObject.newBuilder
builder += "foo" -> "bar"
builder += "x" -> "y"
builder += ("pie" -> 3.14)
builder += ("spam" -> "eggs", "mmm" -> "bacon")
val newObj = builder.result

The goal of this syntax is to be more readable, similar to what one might expect from a dynamic language like Ruby or Python.

Talking to ActiveDirectory From IronPython

| Comments

We’re building a new intranet system at work, and I’ve been toying with a few things that the Windows admin asked for. Namely, since the secretaries here will update the intranet data to add people’s Work & Emergency contact numbers, AIM handles, email addresses, etc. that we find a way to keep it all in sync with ActiveDirectory. Thereby keeping all the Outlooks and Blackberries up to date with the latest contact information.

This seemed like a fairly reasonable request, presuming we could figure out how to do it and since I’ve been using Mono and IronPython a lot more lately, I figured there would be a way to accomplish it. Most of the information I found online was either really old and/or crappy docs for doing it in C#, or more commonly using PowerShell or VBScript. So, I managed to poke around and sort out how to get IronPython on Mono (IronPython 2.6RC + Mono 2.4.2.3) to find and update our users.

The end result is that I can now, from IronPython, find and update valid information on ActiveDirectory entries to reflect the latest and greatest information. One thing to note, the MS .Net ActiveDirectory APIs (System.DirectoryServices, which is mirrored in Mono) do something that confused and annoyed me. There are a limited set of ‘valid’ attribute keys for a user object in Active Directory (Which is really just LDAP, in case you didn’t know). The DirectoryEntry object has a Properties attribute, which contains a hashmap of these values.

Double Clicking in IronPython + Silverlight

| Comments

Silverlight rather oddly lacks a double click event. You can detect single click, but for double clicking you’re on your own. I found some examples for C#, but none for IronPython and had, a few months ago, ported some code I found here for IronPython.

I’ve been using it in application development and user testing for about 6 months and it works fairly well in both Silverlight 2 and 3, although YMMV.

CLICK_INTERVAL = 500
 
class ClickHandler(object):
    """Attach a Double Click handler to any given UIElement,
    as by default Silverlight lacks support for double clicking.
    """
    _lastClickTime = 0
    _callback = None
 
    @staticmethod
    def attachDoubleClickHandler(target, callback):
        return ClickHandler(target, callback)
 
    def __init__(self, target, callback):
        target.MouseLeftButtonUp += self.target_MouseLeftButtonUp
        target.MouseLeftButtonDown += self.target_MouseLeftButtonDown
        self._callback = callback
 
    def target_MouseLeftButtonUp(self, sender, args):
        self._lastClickTime = DateTime.Now.Ticks / 10000
 
    def target_MouseLeftButtonDown(self, sender, args):
        now = DateTime.Now.Ticks / 10000
        if now - self._lastClickTime < CLICK_INTERVAL and \
          now - self._lastClickTime > 25:
            self._callback(sender, args)

You may want to tweak the click interval based on what works for you. 500 milliseconds worked well with my users, for the application in question. To attach a ClickHandler simply invoke the static method ClickHandler.attachDoubleClickHandler(«uiElementObj», «callbackFunc»)