Last night I attended the Python Users Netherlands (PUN) Meeting hosted by Nelen & Schuurmans in Utrecht. This must have been the third or fourth time I attended a PUN Meeting. Each time I'm struck by the small size of the Python community in The Netherlands. I do realize not every Python developer in The Netherlands attends every or even any PUN meeting. But still, it is an excellent way to keep in touch with fellow Python developers and get a feel for what Python developers in The Netherlands are working on.
If you are a Python developer looking for work there is even more reason to attend. This time two companies formally announced they were looking for Python developers to join their development teams. Even more people in the audience raised their hands when asked whether their company was looking for Python developers as well. I heard similar sounds when I attended PyGrunn earlier this year. Even in these challenging times there seems plenty of opportunity for good Python developers.
The 30 min Talks
On the agenda the two half hour talks both mentioned Pyramid. I have played with Pyramid once before and came away very impressed. So I was hoping to see more details on how people were using Pyramid in production. The two talks were a little light on specific Pyramid details, but did gave a very good overview of the products being built on top of it.
Wichert Akkerman's talk on How 2Style4You uses Pyramid (and more Python) intriguingly showed the complexities of developing i18n and l10n applications. Something most Dutch developers normally don't have to take into account.
Wichert was also kind enough to lend me his laptop for my 5 min lightning talk as mine had broken done earlier that day.
Marcel van den Elst seemed to have a lot of fun developing with MongoDB. His talk on MongoEngine + Relational + Privileges (on Pyramid!) showed the open source tools his company had built on top of MongoDB to implement their progressive planning tool.
The 5 min Lightning Talks
I started off the lightning talks with my 5 min presentation on Disco. Let me tell you that 5 minutes to present something is really, really short. I should have prepared myself a little better because I ran out of time as soon as I had started.
Anyway, back to Disco. A couple of months ago I was researching the means to process (OCR'ing, shape detection in diagrams, cross referencing, etc) a fairly large number of documents (60K - 100K). Processing that many documents on a single machine took a number of weeks. How could we speed that up?
One of the possible solutions I came across was Disco. Disco is a MapReduce implementation written in Erlang (core) and Python (tools and the MapReduce jobs). Working for a C# development shop at the time Disco was not immediately applicable. It did, however, remain in the back of my head to return to one day. The 5 min lightning talks seemed like a good excuse to play with it a bit more.
Installation of Disco was fairly simple. However I did have to patch one file (lib/disco/comm.py) to get chunking in its ddfs tool working. When I later realized a fix for this issue had been available in Disco's repository for over six months as a pull request I prematurely drew the conclusion that Disco was not actively developed.
Today, while writing this blog post, I took another look at Disco's project page at Github and noticed the steady stream of commits. Hence it is, contrary to what I said at the PUN meeting, definitely actively developed.
So, what makes Disco so interesting that I wanted to bring it to the attention of other Python developers? An example shows that best:
from disco.core import Job, result_iterator def map(line, params): for word in line.split(): yield word, 1 def reduce(iter, params): from disco.util import kvgroup for word, counts in kvgroup(sorted(iter)): yield word, sum(counts) if __name__ == '__main__': job = Job().run( input=["http://discoproject.org/media/text/chekhov.txt"], map=map, reduce=reduce) for word, count in result_iterator(job.wait(show=True)): print word, count
That's plain and simple Python code! Furthermore it's a complete Disco MapReduce job. As you can see it only takes two functions, map and reduce, without a lot of boiler plate code to implement a MapReduce job. Now compare that to writing a simple Apache Hadoop client
Something slightly more complicated, an inner_join operation on arbitrarily large datasets still looks simple. I think it is a testimony to good design if problems can be expressed easily and succinctly in a framework.
You might wonder whether Disco actually scales. After all, MapReduce problems crave to be distributed over as many nodes as you can dedicate to it. Well, Nokia Research Center in Palo Alto runs Disco on an 800 node cluster. That should give you an idea of its scalability.
- Proven to scale to hundreds of CPUs and tens of thousands of simultaneous tasks.
- Used to process datasets in the scale of tens of terabytes.
- Extremely simple to use: A typical tasks consists of two functions written in Python and two calls to the Disco API.
- Tasks can be specified in any other language as well, by implementing the Disco worker protocol.
- Input data can be in any format, even binary data such as images. The data can be located on any source that is accessible by HTTP or it can distributed to local disks.
- Fault-tolerant: Server crashes don’t interrupt jobs. New servers can be added to the system on the fly.
- Flexible: In addition to the core map and reduce functions, a combiner function, a partition function and an input reader can be provided by the user.
- Easy to integrate to larger applications using the standard Disco module and the Web APIs.
- Comes with a built-in distributed storage system (Disco Distributed Filesystem).
5 minutes to intrigue you, I hope it worked.
The other 5 min Lightning Talks
The other talks covered a wide range of topics:
- (Really) naive data mining, Joël Cox
- "Requests" library for easy json api access + testing dikes, Reinout van Rees
- Python for those little throwaway scripts (that you end up not throwing away), Tikitu de Jager
- Shell pearls (not to be confused with Perl shells), Remco Wendt
- A script for running shell commands from the OS X command line but executed in Virtual Machines, Reinout van Rees
All in all an evening well spent!