Parallel Map

Parallelism means doing multiple things at once. Writing your programs in a parallel way can make them faster in two scenarios:

  1. when your computer has multiple CPUs (or multiple cores, which are like CPUs on a bigger chip).
  2. when some of the things being done in parallel involve waiting (for example, for a file to download or time to pass)

There are many ways to write parallel code. We'll use Python's multiprocessing pools, which start multiple Python processes, then split work across them.

First, we'll create a function that is artificially slow.

In [1]:
from time import sleep, time

def add(xy):
    sleep(0.1) # you could imagine this is some complicated, slow calculation
    return xy[0] + xy[1]

t0 = time()
print("result:", add((2,3)))
t1 = time()
print(t1 - t0, "seconds")
result: 5
0.10309100151062012 seconds

If we add 10 numbers sequentially (meaning one-at-a-time -- the opposite of in parallely), it will take 10 times that long (~1 second total). A loop is one way to perform this task sequentially.

In [2]:
xy_pairs = [(10,1), (10,2), (10,3), (10,4), (10,5), (3,3), (2,2), (1,1), (1,3), (4,3)]

t0 = time()
for xy in xy_pairs:
    print("result:", add(xy))
t1 = time()
print(t1 - t0, "seconds")
result: 11
result: 12
result: 13
result: 14
result: 15
result: 6
result: 4
result: 2
result: 4
result: 7
1.032318115234375 seconds

Python's map function calls a function for us on each input, yielding each result, but it still works sequentially.

In [3]:
t0 = time()
for result in map(add, xy_pairs):
    print("result:", result)
t1 = time()
print(t1 - t0, "seconds")
result: 11
result: 12
result: 13
result: 14
result: 15
result: 6
result: 4
result: 2
result: 4
result: 7
1.0275828838348389 seconds

Remember that map works with function references, something like this:

In [4]:
def mymap(func_ref, inputs):
    for v in inputs:
        result = func_ref(v)
        yield result

Python's multiprocessing module has a Pool class. A pool is a collection of processes. We can create a pool with a specified number of processes (say 5), then use the pool's version of map, which IS parallel (each process does a subset of the work).

In [5]:
from multiprocessing import Pool

with Pool(5) as p:
    t0 = time()
    for result in p.map(add, xy_pairs):
        print("result:", result)
    t1 = time()
    print(t1 - t0, "seconds")
result: 11
result: 12
result: 13
result: 14
result: 15
result: 6
result: 4
result: 2
result: 4
result: 7
0.20520401000976562 seconds

It looks like using 5 processes made it 5 times faster! Let's try 10 processes.

In [6]:
with Pool(10) as p:
    t0 = time()
    for result in p.map(add, xy_pairs):
        print("result:", result)
    t1 = time()
    print(t1 - t0, "seconds")
result: 11
result: 12
result: 13
result: 14
result: 15
result: 6
result: 4
result: 2
result: 4
result: 7
0.10577178001403809 seconds

We're getting good speedups as we add more processes as most of the "work" is sleeping, and we can do plenty of that with limited cores. If we need to do actual computation (for example, looping 3 million times),

In [7]:
def add(xy):
    for i in range(3000000): # 3 million
        pass
    return xy[0] + xy[1]

with Pool(1) as p:
    t0 = time()
    for result in p.map(add, xy_pairs):
        print("result:", result)
    t1 = time()
    print(t1 - t0, "seconds (1 process)")
    
with Pool(10) as p:
    t0 = time()
    for result in p.map(add, xy_pairs):
        print("result:", result)
    t1 = time()
    print(t1 - t0, "seconds (10 process)")
result: 11
result: 12
result: 13
result: 14
result: 15
result: 6
result: 4
result: 2
result: 4
result: 7
0.6721689701080322 seconds (1 process)
result: 11
result: 12
result: 13
result: 14
result: 15
result: 6
result: 4
result: 2
result: 4
result: 7
0.16287779808044434 seconds (10 process)

The difference between the speed with 1 process vs. 10 processes will depend on the number of cores. If there is only 1 core, the times are probably the same. If there are 10 or more cores, the second measurement will probably be ten times faster. In most cases, it will be somewhere between.

Based on the two times above, can you estimate how many core's your instructor's laptop has?

Bugs

Using multiple processes for parallelism is tricky because each process has its own variables/state. It will take care of passing in inputs and returning outputs, but avoid modifying global varialbles. They'll get modified in the pool processes, but not in the process used for your Jupyter notebook (which is what people making this mistake are usually trying to do).

In [8]:
total = 0

def increment(amt):
    global total
    total += amt
    print("SUB TOTAL SO FAR", total)

with Pool(2) as p:
    p.map(increment, [3,2,9,1])
SUB TOTAL SO FAR 3
SUB TOTAL SO FAR 2
SUB TOTAL SO FAR 12
SUB TOTAL SO FAR 3

There are two processes, each with it's own total global variable, so even though they do some counting, they don't see the total of 15. Worse, the process that made the map call (this Jupyter notebook) has yet another total global variable, which didn't get modified at all.

In [9]:
total
Out[9]:
0