I spent two days to study parallel computing by using Python’s multiprocessing module.

My Debian box has AMD 6-core 64bit processor. Python version is 2.7.5.

Interesting things I’ve found so far is using the multiprocessing module does produce 3 to 4 times faster than sequential calculation. But it has big disadvantages that drops the performance of parallel computing below sequential one.

The first one is multicores take up system memory really fast. Once it starts to use swap memory, the performance is over.

The second one is collecting finished data from all the cores increases total processing time.

This method is simply divides the data by number of cores, each core have equal size of the data, and when all cores finish calculation, it collects them to make one big data through multiprocessing.Queue class.

f2() is sequential function that generates (x, y, z) tuples and store them in a list object.

def f2(A, B, C): rgb = [] for a in A: for b in B: for c in C: rgb.append((a, b, c)) return rgb def mp_f2(nums, B, C, nprocs): import multiprocessing as mp import math def worker(As, B, C, out_q): print 'Starting:', mp.current_process().name t0 = time.time() out_q.put(f2(As, B, C)) print 'Exiting:', mp.current_process().name, time.time()-t0 out_q = mp.Queue() chunksize = int(math.ceil(len(nums) / float(nprocs))) procs = [] for i in range(nprocs): p = mp.Process( target=worker, args=(nums[chunksize * i:chunksize * (i+1)], B, C, out_q)) procs.append(p) p.start() result_list = [] t0 = time.time() for i in range(nprocs): # result_list = result_list + out_q.get() result_list.extend(out_q.get()) print 'Queue processing time: ', time.time()-t0 # wait for finishing of all worker processes for p in procs: p.join return result_list

Here is the testing part. I’ve changed core number between 1 to 12.

R = range(0, 512) G = range(0, 256) B = range(0, 256) t1 = time.time() # rgb_table1 = f1(R,G,B) t2 = time.time() dt = t2 - t1 # print dt, len(rgb_table1) t1 = time.time() rgb_table2 = f2(R,G,B) t2 = time.time() dt2 = t2 - t1 print dt2, len(rgb_table2) core_number = 12 t1 = time.time() rgb_table3 = mp_f2(R, G, B, core_number) dt3 = time.time() - t1 print dt3, len(rgb_table3)

Here is sample output of above codes.

function f2() takes 12.72 seconds to generate 33,554,432 tuples of list.

function mp_f2() that uses 12 cores in parallel takes around 5 seconds average and

eats up over 50 seconds for collecting lists of tuples from 12 cores.

Queue processing time: 54.9099760056

So total calculations time of mp_f2() is 58 seconds which is 4 times slower than single core. 😯

12.7239580154 33554432

Starting: Process-1

Starting: Process-2

Starting: Process-3

Starting: Process-4

Starting: Process-5

Starting: Process-6

Starting: Process-7

Starting: Process-8

Starting: Process-9

Starting: Process-10

Starting: Process-11

Starting: Process-12

Exiting: Process-4 4.4180560112

Exiting: Process-3 4.68716597557

Exiting: Process-1 4.96371507645

Exiting: Process-2 4.9275701046

Exiting: Process-7 4.75080990791

Exiting: Process-6 5.19996786118

Exiting: Process-5 5.50654911995

Exiting: Process-8 5.22987914085

Exiting: Process-9 5.03168988228

Exiting: Process-10 4.85275483131

Exiting: Process-12 4.33220481873

Exiting: Process-11 4.67043995857

Queue processing time: 54.9099760056

58.0520889759 33554432

I will test few other methods that uses parallel computing.