Disco
Anyway, back to Disco.
A couple of months ago I was researching the means
to process (OCR'ing, shape detection in diagrams, cross referencing, etc)
a fairly large number of documents (60K - 100K).
Processing that many documents on a single machine took a number of weeks.
How could we speed that up?
One of the possible solutions I came across was Disco.
Disco is a MapReduce implementation written
in Erlang (core)
and Python (tools and the MapReduce jobs).
Working for a C# development shop at the time Disco was not immediately applicable.
It did, however, remain in the back of my head to return to one day.
The 5 min lightning talks seemed like a good excuse to play with it a bit more.
Installation of Disco was fairly simple.
However I did have to patch one file (lib/disco/comm.py)
to get chunking in its ddfs tool working.
When I later realized a fix for this issue had been available in Disco's repository for over six months
as a pull request
I prematurely drew the conclusion that Disco was not actively developed.
Today,
while writing this blog post,
I took another look at Disco's project page at Github
and noticed the steady stream of commits.
Hence it is, contrary to what I said at the PUN meeting, definitely actively developed.
So, what makes Disco so interesting
that I wanted to bring it to the attention of other Python developers? An example shows that best:
from disco.core import Job, result_iterator
def map(line, params):
for word in line.split():
yield word, 1
def reduce(iter, params):
from disco.util import kvgroup
for word, counts in kvgroup(sorted(iter)):
yield word, sum(counts)
if __name__ == '__main__':
job = Job().run(
input=["http://discoproject.org/media/text/chekhov.txt"],
map=map,
reduce=reduce)
for word, count in result_iterator(job.wait(show=True)):
print word, count
That's plain and simple Python code!
Furthermore it's a complete Disco MapReduce job.
As you can see
it only takes two functions, map and reduce,
without a lot of boiler plate code to implement a MapReduce job.
Now compare that to writing a simple Apache Hadoop client
Something slightly more complicated,
an inner_join operation on arbitrarily large datasets
still looks simple.
I think it is a testimony to good design
if problems can be expressed easily and succinctly in a framework.
You might wonder whether Disco actually scales.
After all, MapReduce problems crave to be distributed over as many nodes as you can dedicate to it.
Well, Nokia Research Center in Palo Alto runs Disco on an 800 node cluster.
That should give you an idea of its scalability.
According to the What is Disco page,
Disco's main features are:
- Proven to scale to hundreds of CPUs and tens of thousands of simultaneous tasks.
- Used to process datasets in the scale of tens of terabytes.
- Extremely simple to use: A typical tasks consists of two functions written in Python and two calls to the Disco API.
- Tasks can be specified in any other language as well, by implementing the Disco worker protocol.
- Input data can be in any format, even binary data such as images. The data can be located on any source that is accessible by HTTP or it can distributed to local disks.
- Fault-tolerant: Server crashes don’t interrupt jobs. New servers can be added to the system on the fly.
- Flexible: In addition to the core map and reduce functions, a combiner function, a partition function and an input reader can be provided by the user.
- Easy to integrate to larger applications using the standard Disco module and the Web APIs.
- Comes with a built-in distributed storage system (Disco Distributed Filesystem).
5 minutes to intrigue you, I hope it worked.