Suppose I have an iterator that only works on the main thread (throws an exception otherwise), but I still want to distribute work (one task per item from the iterator) over several processes. (Because the cost of the work per item is much higher than the cost of the iteration.)
How can I modify the (toy) program below to distribute the work over several processes, without modifying Graph
or GraphIterator
or get_number_from_graph
, and using only standard Python libraries?
from multiprocessing import Pool
import threading
class Graph:
def __init__(self, num_vertices):
self._num_vertices = num_vertices
class GraphIterator:
def __init__(self, num_graphs):
self._num_graphs = num_graphs
self._current_graph = 0
def __iter__(self):
return self
def __next__(self):
assert threading.current_thread() is threading.main_thread(), 'iterator only works on the main thread'
if self._current_graph < self._num_graphs:
self._current_graph += 1
return Graph(self._current_graph)
else:
raise StopIteration
def get_number_from_graph(graph):
return graph._num_vertices
if __name__ == '__main__':
num_graphs = 100
print('Sequential result:', sum(get_number_from_graph(g) for g in GraphIterator(num_graphs)))
print('Parallel result: ', end='')
result = 0
with Pool(processes=None) as pool:
for t in pool.imap(get_number_from_graph, GraphIterator(num_graphs)):
result += t
print(result)
Current output:
Sequential result: 5050
Parallel result: Traceback (most recent call last):
File "/home/rburing/src/gcaops/multiprocessing_issue.py", line 33, in <module>
for t in pool.imap(get_number_from_graph, GraphIterator(num_graphs)):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
AssertionError: iterator only works on the main thread
Desired output:
Sequential result: 5050
Parallel result: 5050
8
ProcessPoll imap
will indeed run the iterator in another thread –
Run the iterator in a normal for loop, and use the apply_async pool method instead of one of the map variants: that way you control the iterator in the current thread. There could be some boiler plate for retrieving the results, so it might be even better to use concurrent.futures.ProcessPollExecutor and the .submit method instead, and then concurrent.futures.as_completed
to retrieve the results without the need for a callback:
from concurrent.futures import ProcessPoolExecutor, as_completed
...
if __name__ == '__main__':
num_graphs = 100
print('Sequential result:', sum(get_number_from_graph(g) for g in GraphIterator(num_graphs)))
print('Parallel result: ', end='')
result = 0
with ProcessPoolExecutor() as executor:
futures = {executor.submit(get_number_from_graph, item) for item in GraphIterator(num_graphs)}
result = sum(fut.result() for result in as_completed(futures))
If you need the results in order or otherwise to know which inut generated which output, this will need some modifications – but simply associating each future with a sequential number, and then returning that number along with the result allows one to build a dictionary instead:
...
def enumerator(func):
def wrapper(item_number, *args, **kwargs):
result = func(*args, **kwargs)
return item_number, result
return wrapper
@enumerator
def get_number_from_graph(graph):
return graph._num_vertices
if __name__ == '__main__':
num_graphs = 100
...
with ProcessPoolExecutor() as executor:
futures = {executor.submit(get_number_from_graph, index, item) for index, item in enumerate(GraphIterator(num_graphs))}
result = dict(fut.result() for result in as_completed(futures))
3