Mapping Python Code Over Records With lwpb

Tuesday, February 08, 2011.

In which we reimplement wc(1) as a python one-liner, discover a neat feature in the Python interpreter, rip through a bunch of records in a document database, and generally start to wonder if we're converging on the awk(1) of Protocol Buffers.

In my work on lwpb, a library which includes a fast Python encoder and decoder for Google Protocol Buffers, one of the first things I needed was a comfortable way to convert between a protobuf stream and a plain old text stream. You know, the usual Unix tab- and newline-delimited records thing. Once you've got this conversion, your old friends grep(1), cut(1), sort(1), et al. can help you again.

The conversion tool I came up with is called pbio. It converts in both directions, and can also do some other things like extract a range of records. So far, pretty pedestrian right?

But with a mere 8 line patch, pbio has suddenly become immensely more useful: you can now map Python code supplied at the command line over your records, producing new calculated fields. This is a big step towards MapReduce-style programming, but without the overhead of having to write a separate program each time which defines a distinct map function. As a programmer, I'm always looking for ways to write less code and still get the job done.

In the pbio case, there is zero overhead code required to calculate and output a new field, a surprising and mostly accidental consequence of lwpb's decision to encode and decode using dictionaries, together with a serendipitous feature of Python's exec() built-in. More on that in a second.

To frame all of this with an example, suppose you have the following simple schema.proto file for a document database.

package org;

message Document {
  required uint64 docid = 1;
  required string url = 2;
  required string content = 3;
};

Perhaps this database is populated by a web crawler. You'd like to know the length, in bytes, of each document.

Sure, you could dump the entire content of each document with pbio and pipe that to a script that calculates lengths, but that's a bit wasteful. And you're also going to have to grok the percent-escaped sequences that pbio uses.

Here's a better way:

pbio.py -F 'url,length' -e 'length=len(content)' -p schema.pb -m org.Document < docs.pb

Let's break down what's going on here:

  • pbio is inputting protobuf records, and outputting text records (the default mode)
  • -F 'url,length' tells pbio to output text records with these two fields
  • -p schema.pb specifies a compiled version of schema.proto you've created with protoc
  • -m org.Document says records in docs.pb conform to the org.Document message type
  • -e 'length=len(content)' calculates a new field in the output record named length

The -F, -p, and -m options are just about input and output, while the -e option is more interesting. How does pbio know to populate a new field in the output record named length with the calculated value? Does it parse the code, looking for assignments that match up with fields specified by -F? And how has a field of the input record, content, become available as a local variable?

This is where Python's exec() comes in:

exec_stmt ::= "exec" or_expr ["in" expression ["," expression]]

In all cases, if the optional parts are omitted, the code is executed in the current scope. If only the first expression after in is specified, it should be a dictionary, which will be used for both the global and the local variables... As a side effect, an implementation may insert additional keys into the dictionaries given besides those corresponding to variable names set by the executed code.

The input record, which has been decoded by lwpb into a dictionary, becomes the scope in which the user-supplied code executes. Each input field becomes a local variable in the new scope -- in our example, this means the user-supplied code automatically sees local variables docid, url, and content, one for each field in the input record. After execution, any new local variables created by assignments become new fields in the output record.

More complicated code is possible. For instance, here's wc(1):

pbio.py -F 'lines,words,chars,url' -p schema.pb -m org.Document < docs.pb -e '
chars=len(content)
words=len(content.split())
lines=len(content.split("\n"))'

To wrap up the original example, let's find the top 10 longest documents:

pbio.py -F 'url,length' -e 'length=len(content)' -p schema.pb -m org.Document < docs.pb | sort -k2 -nr | head -10
Posted by Alan on Tuesday, February 08, 2011. (Discuss)

blog comments powered by Disqus
maelstrom

"After a little while I became possessed with the keenest curiosity about the whirl itself. I positively felt a wish to explore its depths, even at the sacrifice I was going to make; and my principal grief was that I should never be able to tell my old companions on shore about the mysteries I should see."

Illustration for Edgar Allan Poe's story "Descent into the Maelstrom" by Harry Clarke, published in 1919.