{
"metadata": {
"name": "",
"signature": "sha256:1dfc0e2273fbb74ffe4b603f8778af1f5d3b35d43629a26823bacc2aa0115250"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parallel Programming with numpy and scipy\n",
"\n",
"TAGS: Performance\n",
"\n",
"Multiprocessor and multicore machines are becoming more common, and it would\n",
"be nice to take advantage of them to make your code run faster. numpy/scipy\n",
"are not perfect in this area, but there are some things you can do. The best\n",
"way to make use of a parallel processing system depend on the task you're\n",
"doing and on the parallel system you're using. If you have a big complicated\n",
"job or a cluster of machines, taking full advantage will require much thought.\n",
"But many tasks can be parallelized in a fairly simple way.\n",
"\n",
"As the saying goes, \"[premature optimization is the root of all evil](\n",
"http://www.acm.org/ubiquity/views/v7i24_fallacy.html)\". Using a\n",
"multicore machine will provide at best a speedup by a factor of the number of\n",
"cores available. Get your code working first, before even thinking about\n",
"parallelization. Then ask yourself whether your code actually needs to be any\n",
"faster. Don't embark on the bug-strewn path of parallelization unless you have\n",
"to.\n",
"\n",
"## Simple parallelization\n",
"\n",
"### Break your job into smaller jobs and run them simultaneously\n",
"\n",
"For example, if you are analyzing data from a pulsar survey, and you have\n",
"thousands of beams to analyze, each taking a day, the simplest (and probably\n",
"most efficient) way to parallelize the task is to simply run each beam as a\n",
"job. Machines with two processors can just run two jobs. No need to worry\n",
"about locking or communication; no need to write code that knows it's running\n",
"in parallel. You can have issues if the processes each need as much memory as\n",
"your machine has, or if they are both IO heavy, but in general this is a\n",
"simple and efficient way to parallelize your code - if it works. Not all tasks\n",
"divide up this nicely. If your goal is to process a single image, it's not\n",
"clear how to do this without a lot of work.\n",
"\n",
"### Use parallel primitives\n",
"\n",
"One of the great strengths of numpy is that you can express array operations\n",
"very cleanly. For example to compute the product of the matrix A and the\n",
"matrix B, you just do:\n",
"\n",
" \n",
" \n",
" >>> C = numpy.dot(A,B)\n",
"\n",
"Not only is this simple and clear to read and write, since numpy knows you\n",
"want to do a matrix dot product it can use an optimized implementation\n",
"obtained as part of \"BLAS\" (the Basic Linear Algebra Subroutines). This will\n",
"normally be a library carefully tuned to run as fast as possible on your\n",
"hardware by taking advantage of cache memory and assembler implementation. But\n",
"many architectures now have a BLAS that also takes advantage of a multicore\n",
"machine. If your numpy/scipy is compiled using one of these, then dot() will\n",
"be computed in parallel (if this is faster) without you doing anything.\n",
"Similarly for other matrix operations, like inversion, singular value\n",
"decomposition, determinant, and so on. For example, the open source library\n",
"[ATLAS](http://math-atlas.sourceforge.net/) allows\n",
"compile time selection of the level of parallelism (number of threads). The\n",
"proprietary\n",
"[MKL](http://www.intel.com/cd/software/products/asmo-\n",
"na/eng/307757.htm) library from Intel offers the possibility to chose the\n",
"level of parallelism at runtime. There is also the [GOTO](\n",
"http://www.tacc.utexas.edu/resources/software/software_downloads.php)\n",
"library that allow run-time selection of the level of parallelism. This is a\n",
"commercial product but the source code is distributed free for academic use.\n",
"\n",
"Finally, scipy/numpy does not parallelize operations like\n",
"\n",
" \n",
" \n",
" >>> A = B + C\n",
" >>> A = numpy.sin(B)\n",
" >>> A = scipy.stats.norm.isf(B)\n",
"\n",
"These operations run sequentially, taking no advantage of multicore machines\n",
"(but see below). In principle, this could be changed without too much work.\n",
"OpenMP is an extension to the C language which allows compilers to produce\n",
"parallelizing code for appropriately-annotated loops (and other things). If\n",
"someone sat down and annotated a few core loops in numpy (and possibly in\n",
"scipy), and if one then compiled numpy/scipy with OpenMP turned on, all three\n",
"of the above would automatically be run in parallel. Of course, in reality one\n",
"would want to have some runtime control - for example, one might want to turn\n",
"off automatic parallelization if one were planning to run several jobs on the\n",
"same multiprocessor machine.\n",
"\n",
"### Write multithreaded or multiprocess code\n",
"\n",
"Sometimes you can see how to break your problem into several parallel tasks,\n",
"but those tasks need some kind of synchronization or communication. For\n",
"example, you might have a list of jobs that can be run in parallel, but you\n",
"need to gather all their results, do some summary calculation, then launch\n",
"another batch of parallel jobs. There are two approaches to doing this in\n",
"Python, using either multiple [threads](http://en.wikipe\n",
"dia.org/wiki/Thread_\\(computer_science\\)) or [processes](\n",
"http://en.wikipedia.org/wiki/Process_\\(computing\\)).\n",
"\n",
"#### Threads\n",
"\n",
"Threads are generally 'lighter' than processes, and can be created, destroyed\n",
"and switched between faster than processes. They are normally preferred for\n",
"taking advantage of multicore systems. However, multithreading with python has\n",
"a key limitation; the [Global Interpreter\n",
"Lock](http://docs.python.org/api/threads.html)\n",
"([GIL](http://effbot.org/pyfaq/what-is-the-global-\n",
"interpreter-lock.htm)). For various reasons (a quick web search will turn up\n",
"copious [discussion](http://blog.ianbicking.org/gil-of-\n",
"doom.html), not to say [argument](http://mail.python.org\n",
"/pipermail/python-3000/2007-May/007414.html), over [why this exists](\n",
"http://www.artima.com/weblogs/viewpost.jsp?thread=214235) and\n",
"[whether it's a good\n",
"idea](http://blog.snaplogic.org/?p=94)), python is\n",
"implemented in such a way that only one thread can be accessing the\n",
"interpreter at a time. This basically means only one thread can be running\n",
"python code at a time. This almost means that you don't take any advantage of\n",
"parallel processing at all. The exceptions are few but important: while a\n",
"thread is waiting for IO (for you to type something, say, or for something to\n",
"come in the network) python releases the GIL so other threads can run. And,\n",
"more importantly for us, while numpy is doing an array operation, python also\n",
"releases the GIL. Thus if you tell one thread to do:\n",
"\n",
" \n",
" \n",
" >>> print \"%s %s %s %s and %s\" %( (\"spam\",) *3 + (\"eggs\",) + (\"spam\",) )\n",
" >>> A = B + C\n",
" >>> print A\n",
"\n",
"During the print operations and the % formatting operation, no other thread\n",
"can execute. But during the A = B + C, another thread can run - and if you've\n",
"written your code in a numpy style, much of the calculation will be done in a\n",
"few array operations like A = B + C. Thus you can actually get a speedup from\n",
"using multiple threads.\n",
"\n",
"The python threading module is part of the standard library and provides tools\n",
"for multithreading. Given the limitations discussed above, it may not be worth\n",
"carefully rewriting your code in a multithreaded architecture. But sometimes\n",
"you can do multithreading with little effort, and in these cases it can be\n",
"worth it. See [Cookbook/Multithreading](Cookbook/Multithreading.html) for one\n",
"example. [This\n",
"recipe](http://code.activestate.com/recipes/576519/)\n",
"provides a thread Pool() interface with the same API as that found for\n",
"processes (below) which might also be of interest. There is also the [ThreadPo\n",
"ol](http://www.chrisarndt.de/projects/threadpool/)\n",
"module which is quite similar.\n",
"\n",
"#### Processes\n",
"\n",
"One way to overcome the limitations of the GIL discussed above is to use\n",
"multiple full processes instead of threads. Each process has it's own GIL, so\n",
"they do not block each other in the same way that threads do. From python 2.6,\n",
"the standard library includes a [multiprocessing](http:/\n",
"/docs.python.org/library/multiprocessing.html) module, with the same interface\n",
"as the threading module. For earlier versions of Python, this is available as\n",
"the [processing](http://pyprocessing.berlios.de/) module\n",
"(a backport of the multiprocessing module of python 2.6 for python 2.4 and 2.5\n",
"is in the works here:\n",
"[multiprocessing](http://code.google.com/p/python-\n",
"multiprocessing) ). It is possible to share memory between processes,\n",
"including [numpy arrays](http://coding.derkeiler.com/Arc\n",
"hive/Python/comp.lang.python/2008-09/msg00937.html). This allows most of the\n",
"benefits of threading without the problems of the GIL. It also provides a\n",
"simple Pool() interface, which features map and apply commands similar to\n",
"those found in the [Cookbook/Multithreading](Cookbook/Multithreading.html)\n",
"example.\n",
"\n",
"#### Comparison\n",
"\n",
"Here is a very basic comparison which illustrates the effect of the GIL (on a\n",
"dual core machine).\n",
"\n",
" \n",
" \n",
" import numpy as np\n",
" import math\n",
" def f(x):\n",
" print x\n",
" y = [1]*10000000\n",
" [math.exp(i) for i in y]\n",
" def g(x):\n",
" print x\n",
" y = np.ones(10000000)\n",
" np.exp(y)\n",
" \n",
" \n",
" from handythread import foreach\n",
" from processing import Pool\n",
" from timings import f,g\n",
" def fornorm(f,l):\n",
" for i in l:\n",
" f(i)\n",
" time fornorm(g,range(100))\n",
" time fornorm(f,range(10))\n",
" time foreach(g,range(100),threads=2)\n",
" time foreach(f,range(10),threads=2)\n",
" p = Pool(2)\n",
" time p.map(g,range(100))\n",
" time p.map(f,range(100))\n",
"\n",
"| 100 * g() | 10 * f() \n",
"---|---|--- \n",
"normal | `43.5s` | `48s` \n",
"2 threads | `31s` | `71.5s` \n",
"2 processes |`27s` |`31.23` \n",
" \n",
"For function `f()`, which does not release the GIL, threading actually\n",
"performs worse than serial code, presumably due to the overhead of context\n",
"switching. However, using 2 processes does provide a significant speedup. For\n",
"function `g()` which uses numpy and releases the GIL, both threads and\n",
"processes provide a significant speed up, although multiprocesses is slightly\n",
"faster.\n",
"\n",
"## Sophisticated parallelization\n",
"\n",
"If you need sophisticated parallelism - you have a computing cluster, say, and\n",
"your jobs need to communicate with each other frequently - you will need to\n",
"start thinking about real parallel programming. This is a subject for graduate\n",
"courses in computer science, and I'm not going to address it here. But there\n",
"are some python tools you can use to implement the things you learn in that\n",
"graduate course. (I am perhaps exaggerating - some parallelization is not that\n",
"difficult, and some of these tools make it fairly easy. But do realize that\n",
"parallel code is much more difficult to write and debug than serial code.)\n",
"\n",
" * [IPython1](http://ipython.scipy.org/moin/IPython1)\n",
" * [mpi4py](mpi4py.html)\n",
" * [parallel python](http://www.parallelpython.com/)\n",
" * [POSH](http://poshmodule.sourceforge.net/)\n"
]
}
],
"metadata": {}
}
]
}