Python for HPC

Python Logo
Test details
## Agenda - Introduction - Numpy and Friends - Bindings: plug in compiled code - Compilers: compile Python - Runtime compilers: on the fly compiling - Parallel processing: multiprocessing, MPI - Conclusions --- # Introduction --- ## Sieve of Eratosthenes ![Sieve of Eratosthenes animation](Sieve_of_Eratosthenes_animation.gif) [Wikipedia](http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) --- ### sieve.py def primes(limit): sieve = [True]*limit for n in xrange(2, int(limit**0.5 + 1)): if sieve[n]: i = n**2 while i < limit: sieve[i] = False i += n return [pos for pos, v in enumerate(sieve) if v] --- ### Ipython In [8]: import this The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. --- ### IPython: timeit In [1]: from sieve import primes In [2]: %timeit primes(100000) 10 loops, best of 3: 32.2 ms per loop --- ### Profiling In [5]: %load_ext line_profiler In [6]: %lprun -f primes primes(100000) Total time: 0.750672 s % Time Line Contents ====== ============= def primes(limit): 0.2 sieve = [True]*limit 0.0 for n in xrange(2, int(limit**0.5 + 1)): 0.0 if sieve[n]: 0.0 i = n**2 26.4 while i < limit: 29.7 sieve[i] = False 29.1 i += n 14.5 return [pos for pos, i in ... http://pynash.org/2013/03/06/timing-and-profiling.html --- # Numpy and Friends --- ## Numpy - Provides multidimensional arrays and methods - Runs almost as quickly as C - Compares to MATLAB in functionality --- ### Numpy: Dot Product >>> import numpy >>> A = numpy.array([[1, 2, 3], [4, 5, 6]], dtype=numpy.float64) >>> B = numpy.array([[4,5], [6,7], [8,9]], dtype=numpy.float64) >>> C = A.dot(B) >>> print C [[ 40. 46.] [ 94. 109.]] --- ## Numpy features - Mathematical functions(trigonometry, complex number, etc). - Linear Algebra - FFT - Statistics - Polynomial (evaluation, fitting and manipulations) - Financial functions --- ## Numpy ecosystem: Scipy - *cluster*: Hierarchical clustering, vector quantization, K-means - *fftpack*: Discrete Fourier Transform algorithms - *integrate*: Numerical integration routines - *interpolate*: Interpolation tools - *linalg*: Linear algebra routines - *optimize*: Optimization algorithms including linear programming - *signal*: Signal processing tools - *sparse*: Sparse matrix and related algorithms - *spatial*: Kd-trees, nearest neighbors, distance functions - *special*: Special functions - *stats*: Statistical functions --- ## Numpy Sieve def primes(limit): sieve = numpy.ones(limit, dtype=numpy.bool) for n in xrange(2, int(limit**0.5 + 1)): if sieve[n]: sieve[n*n::n] = False return numpy.nonzero(sieve)[0] [Source](http://rosettacode.org/wiki/Sieve_of_Eratosthenes#Using_numpy) --- ### Timing In [2]: from sieve_numpy import primes In [3]: %timeit primes(100000) 1000 loops, best of 3: 774 ┬Ás per loop --- ### Time Profiling In [10]: from sieve_numpy import primes In [11]: %lprun -f primes primes(100000) Total time: 0.004034 s %Time Line Contents ===== ============= def primes(limit): 18.3 sieve = numpy.ones(limit + 1, dtype=numpy.bool) 14.3 for n in xrange(2, int(limit**0.5 + 1)): 15.9 if sieve[n]: 19.3 sieve[n*n::n] = False 32.2 return numpy.nonzero(sieve)[0] --- ### Memory Profiling In [13]: %mprun -f primes primes(10000000) Filename: sieve_numpy.py Mem usage Increment Line Contents ===================================== 30.4 MiB 0.0 MiB def primes(limit): 40.0 MiB 9.5 MiB sieve = numpy.ones(limit + 40.0 MiB 0.0 MiB for n in xrange(2, int(limi 40.0 MiB 0.0 MiB if sieve[n]: 40.0 MiB 0.0 MiB sieve[n*n::n] = Fal 45.0 MiB 5.1 MiB return numpy.nonzero(sieve) --- ## Numpy ecosystem: Pandas - Similar to R: Python utility belt for data analysts - Based on `DataFrame`, a flexible and indexable data structure [Panda highlights](http://pandas.pydata.org/index.html#library-highlights) --- ### Pandas & Matplotlib In [7]: df = pd.DataFrame(numpy.random.randn(1000, 4), index=pd.date_range('20000101',periods=1000), columns=['A', 'B', 'C', 'D']) In [8]: df.head() Out[8]: A B C D 2000-01-01 0.764754 -0.362058 -0.968327 0.661106 2000-01-02 -0.128322 -0.643062 -0.758038 0.904152 2000-01-03 0.044131 -0.793913 -0.825394 -0.868571 2000-01-04 -0.800102 -2.252049 -0.325173 -0.125491 2000-01-05 -1.963810 -1.859499 1.331451 -0.479015 In [9]: df = df.cumsum() In [10]: df.plot() In [11]: plt.legend(loc='best') Must be run with `ipython --pylab` --- ### Result ![DataFrame](dataframe.png) --- ### Quick tour In [68]: import pandas as pd In [69]: import numpy as np In [70]: dates = pd.date_range('20130101',periods=6) In [71]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) In [72]: df.describe() Out[72]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.116415 -0.213404 0.600224 0.057387 std 1.434106 0.678741 1.053339 1.094433 min -2.460292 -1.344600 -1.118852 -1.218678 25% -0.318416 -0.345801 0.067940 -0.592758 50% 0.897299 -0.218745 0.994070 -0.105262 75% 0.930749 0.080316 1.194160 0.444741 max 1.145937 0.709814 1.711972 1.900707 --- ### Quick tour In [73]: df.sort('A') Out[73]: A B C D 2013-01-02 -2.460292 -1.344600 -0.193278 0.527732 2013-01-01 -0.716522 -0.165612 1.213364 -0.406292 2013-01-03 0.875902 -0.271879 -1.118852 -1.218678 2013-01-04 0.918697 -0.370441 0.851592 0.195767 2013-01-05 0.934766 0.162292 1.136547 1.900707 2013-01-06 1.145937 0.709814 1.711972 -0.654913 In [74]: df.T Out[74]: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A -0.716522 -2.460292 0.875902 0.918697 0.934766 1.145937 B -0.165612 -1.344600 -0.271879 -0.370441 0.162292 0.709814 C 1.213364 -0.193278 -1.118852 0.851592 1.136547 1.711972 D -0.406292 0.527732 -1.218678 0.195767 1.900707 -0.654913 --- ### Quick tour In [75]: df[df.A > 0] Out[75]: A B C D 2013-01-03 0.875902 -0.271879 -1.118852 -1.218678 2013-01-04 0.918697 -0.370441 0.851592 0.195767 2013-01-05 0.934766 0.162292 1.136547 1.900707 2013-01-06 1.145937 0.709814 1.711972 -0.654913 --- ## Numpy ecosystem: Others - http://scikit-learn.org: Machine Learning in Python - http://statsmodels.sourceforge.net/: explore data, estimate statistical models, and perform statistical tests - Pytables (build on top of Numpy & Cython) - Many other libraries that "supports" numpy like Mpi4py that serialise numpy arrays implicitly --- # Bindings --- ## F2py - Fortran to Python interface generator - Support for Numpy arrays --- ### F2py: Example Some Fortran code C File hello.f subroutine foo (a) integer a print*, "Hello from Fortran!" print*, "a=",a end --- ### F2py: Example Let's compile this f2py -c -m hello hello.f And call it: >>> import hello >>> print hello.foo.__doc__ foo - Function signature: foo(a) Required arguments: a : input int >>> hello.foo(10) Hello from Fortran! a= 10 >>> --- ### F2py: Example All-in-one: import numpy.f2py as f2py fid = open('hello.f') source = fid.read() fid.close() f2py.compile(source, modulename='hello') import hello Source: [Scipy documentation](http://docs.scipy.org/doc/numpy/user/c-info.python-as-glue.html) --- ## Weave code = r""" int i; py::tuple results(2); for (i=0; i<a.length(); i++) { a[i] = i; } results[0] = 3.0; results[1] = 4.0; return_val = results; """ a = [None]*10 res = weave.inline(code,['a']) --- ## ctypes from ctypes import (cdll, Structure, c_int, c_double, c_uint) lib = cdll.LoadLibrary('./libsomelib.so') # Describe the DataPoint structure to ctypes. class DataPoint(Structure): _fields_ = [('num', c_int), ('dnum', c_double)] # Initialize the DataPoint[4] argument dps = (DataPoint * 4)((2, 2.2), (3, 3.3), (4, 4.4), (5, 5.5)) # Grab add_data and specify its return type. add_data_fn = lib.add_data add_data_fn.restype = DataPoint dout = add_data_fn(dps, 4) [Source](http://eli.thegreenplace.net/2013/03/09/python-ffi-with-ctypes-and-cffi/) --- ## Some library #ifndef SOMELIB_H #define SOMELIB_H typedef struct { int num; double dnum; } DataPoint; DataPoint add_data(const DataPoint* dps, unsigned n); #endif /* SOMELIB_H */ Comiled with gcc -std=c99 -shared -fPIC -o libsomelib.so somelib.c --- ## cffi from cffi import FFI ffi = FFI() lib = ffi.dlopen('./libsomelib.so') # Describe the data type and function prototype to cffi. ffi.cdef(''' typedef struct { int num; double dnum; } DataPoint; DataPoint add_data(const DataPoint* dps, unsigned n); ''') # Create an array of DataPoint structs and initialize it. dps = ffi.new('DataPoint[]', [(2, 2.2), (3, 3.3), (4, 4.4), (5, 5.5)]) dout = lib.add_data(dps, 4) --- # Compilers --- ## Cython - Speed-up python code - Better speed-up with annotation - Can do also C bindings --- ### sieve_cython.pyx def primes(int limit): cdef int n, i, pos sieve = [True]*limit for n in xrange(2, int(limit**0.5 + 1)): if sieve[n]: i = n**2 while i < limit: sieve[i] = False i += n return [pos for pos, v in enumerate(sieve) if v] --- ### setup.py from distutils.core import setup from Cython.Build import cythonize setup( ext_modules = cythonize("sieve_cython.pyx") ) Run: python setup.py install --install-lib=. --- ### Timeit In [1]: from sieve_cython import primes In [2]: %timeit primes(100000) 100 loops, best of 3: 2.41 ms per loop --- ## Pypy - Just-in-Time compiler - Faster than Cpython - Sometimes less memory hungry - Sandboxing - Stackless - STM (Software transactional memory) ? - Cffi included --- ### Pypy: Timing $ pypy -m timeit "import sieve" "sieve.primes(100000)" 10 loops, best of 3: 4.98 msec per loop --- ## Pypy: Numpypy - Re-implementation of Numpy in Pypy - Already faster than numpy ([2012 benchmark](http://morepypy.blogspot.be/2012/01/numpypy-progress-report-running.html)) - Incomplete implementation --- ## Shedskin - Python to C++ compiler - No Numpy integration - Does type inference on the code --- ### Compilation & Run $ shedskin -e sieve # -> create sieve.cpp, Makefile $ make # -> Create sieve.so $ python -m timeit "import sieve" "sieve.primes(100000)" 1000 loops, best of 3: 986 usec per loop --- ## Others - [Nuitka](http://nuitka.net) - [Pythran](http://pythonhosted.org/pythran/) --- # Runtime Compilers --- ## Copperhead - Project supported by NVidia - "targets NVIDIA GPUs, as well as multicore CPUs through OpenMP and Threading Building Blocks (TBB)." --- ### Copperhead Example from copperhead import * import numpy as np @cu def axpy(a, x, y): return [a * xi + yi for xi, yi in zip(x, y)] x = np.arange(100, dtype=np.float64) y = np.arange(100, dtype=np.float64) with places.gpu0: gpu = axpy(2.0, x, y) with places.openmp: cpu = axpy(2.0, x, y) --- ## Numba - "Numpy-aware" - Targets LLVM (and Cuda with NumbaPro) - Supported by [Continuum](http://continuum.io) --- ### Numba Example from numba import autojit @autojit def sum2d(arr): M, N = arr.shape result = 0.0 for i in range(M): for j in range(N): result += arr[i,j] return result --- ## Parakeet - Supported bacckend: C (gcc), OpenMP, Cuda - Website: http://www.parakeetpython.com/ --- ### Parakeet: Example import numpy as np from parakeet import jit x = np.array([1,2,3]) y = np.tanh(x * alpha) + beta @jit def fast(x, alpha = 0.5, beta = 0.3): return np.tanh(x * alpha) + beta @jit def loopy(x, alpha = 0.5, beta = 0.3): y = np.empty_like(x, dtype = float) for i in xrange(len(x)): y[i] = np.tanh(x[i] * alpha) + beta return y --- ## Theano - Developed mainly for machine learning - Integration with Numpy - Transparent use of GPU - See also: [Cudamat](http://code.google.com/p/cudamat/) & [gnumpy](http://www.cs.toronto.edu/~tijmen/gnumpy.html) [Theano documentation](http://deeplearning.net/software/theano/) --- ### Theano: Example import theano from theano import tensor # declare two symbolic floating-point scalars a = tensor.dscalar() b = tensor.dscalar() # create a simple expression c = a + b # convert the expression into a callable object that # takes (a,b) values as input and computes a value for c f = theano.function([a,b], c) # bind 1.5 to 'a', 2.5 to 'b', and evaluate 'c' assert 4.0 == f(1.5, 2.5) --- # Parallel processing --- ## Multiprocessing - Goal: Share the load across CPUs on the same computer - And avoid the GIL --- ### Built-in Map Vs. Multiprocessing map >>> l = [1, 2, 3] >>> def my_fun(x): ... return x*2 ... >>> map(my_fun, l) [2, 4, 6] Parallel: from multiprocessing import Pool from time import sleep def work(x): sleep(2) #sleep 2 seconds return x pool = Pool() print pool.map(work, range(10)) --- ### Multiprocessing Package Other features: - Offer local and remote concurrency. - Pipe & Queues (message passing). - Locks, Events, Shared Objects, Manager, etc (shared state). --- ## Joblib - https://pythonhosted.org/joblib/ >>> from joblib import Parallel, delayed >>> from math import sqrt >>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0] --- ## MPI - MPI is a message passing interface - Provides an inter-process communication library. - Exposes logical topology, and synchonisation. - `mpirun` / `mpiexec` takes care of running your executable. - Can be plugged with Ipython --- ## Mpi4py - Object oriented library binded with the C++ implementation. - Almost as fast as compiled C - Python goodness --- ## Install & running $ module load openmpi/1.5.3/gcc-4.4.4 $ easy_install -d .local mpi4py $ mpirun -np 3 python mpi_example_sleep.py ^Z [1]+ Stopped mpirun -np 3 python mpi_example_sleep.py $ ps PID TTY TIME CMD 19115 pts/6 00:00:00 mpirun 19116 pts/6 00:00:00 python 19117 pts/6 00:00:00 python 19118 pts/6 00:00:00 python 19120 pts/6 00:00:00 ps 58814 pts/6 00:00:00 bash $ fg mpirun -np 3 python mpi_example_sleep.py $ --- ## Hello world from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() msg = comm.bcast('hello') print msg, 'world from rank', rank --- ## Broadcast result $ mpirun -np 10 python mpi_example_broadcast.py hello world from rank 0 hello world from rank 2 hello world from rank 4 hello world from rank 6 hello world from rank 8 hello world from rank 1 hello world from rank 3 hello world from rank 5 hello world from rank 7 hello world from rank 9 --- ## Send and Receive from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = {'a': 7, 'b': 3.14} comm.send(data, dest=1) print 'SEND', data else: data = comm.recv(source=0) print 'RECV', data --- ## Send and Receive: The fast way from mpi4py import MPI import numpy comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = numpy.arange(10, dtype='i') comm.Send([data, MPI.INT], dest=1) elif rank == 1: data = numpy.empty(10, dtype='i') comm.Recv([data, MPI.INT], source=0) print data Faster because Numpy arrays can be sent as is, with no serialisation. --- ## Pipeline from mpi4py import MPI import numpy comm = MPI.COMM_WORLD rank = comm.rank size = comm.size if comm.rank == 0: v = numpy.array([rank]*5, dtype=float) print "send", v comm.send(v, dest=(rank+1)%size) data = comm.recv(source=size-1) print "get", data else: data = comm.recv(source=(rank-1)%size) print rank, 'forwards', data comm.send(data+1, dest=(rank+1)%size) --- ## Pipeline: Result $ mpirun -np 2 python mpi_example_pipeline.py send [ 0. 0. 0. 0. 0.] 1 forwards [ 0. 0. 0. 0. 0.] get [ 1. 1. 1. 1. 1.] $ mpirun -np 10 python mpi_example_pipeline.py send [ 0. 0. 0. 0. 0.] 1 forwards [ 0. 0. 0. 0. 0.] 2 forwards [ 1. 1. 1. 1. 1.] 3 forwards [ 2. 2. 2. 2. 2.] 4 forwards [ 3. 3. 3. 3. 3.] 5 forwards [ 4. 4. 4. 4. 4.] 6 forwards [ 5. 5. 5. 5. 5.] 7 forwards [ 6. 6. 6. 6. 6.] 8 forwards [ 7. 7. 7. 7. 7.] 9 forwards [ 8. 8. 8. 8. 8.] get [ 9. 9. 9. 9. 9.] --- ## Python reduce function >>> sum = lambda x, y: x + y >>> sum(2, 3) 5 >>> reduce(lambda x, y: x + y, range(10)) 45 --- ## The same with MPI from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() total = comm.reduce(rank, op=MPI.SUM) print total --- # Python and Slurm --- ### Python submission script #!/bin/env python #SBATCH --job-name=multiprocess #SBATCH --ntasks=1 #SBATCH ... import multiprocessing, sys, os from ... import work # necessary to add cwd to path when script run # by slurm (since it executes a copy) sys.path.append(os.getcwd()) try: ncpus = int(os.environ["SLURM_JOB_CPUS_PER_NODE"]) except KeyError: ncpus = multiprocessing.cpu_count() p = multiprocessing.Pool(ncpus) p.map(work, range(100)) --- # Summary - Numpy and Friends - Bindings: plug in compiled code - Compilers: compile Python - Runtime compilers: on the fly compiling - Parallel processing: multiprocessing, MPI