disparate notes

Simple threading in Python 3

I've been working in Python 3 on an an embarrassingly parallel task of parsing and importing data into Postgres. Once I've done single-threaded implementation I looked around for parallelizing the program and after a few tries I managed to get everything running in parallel.

This is a quick note on how to get a thread pool working in Python 3.6

from concurrent.futures import ThreadPoolExecutor

def data_source():
    """Fetch data to be processed in parallel"""
    # Implement fetching data
    yield item

def process_data(input_value):
    """Process data without modifying input"""
    # Implement saving data to postgres


MAX_THREADS = 8
pool = ThreadPoolExecutor(MAX_THREADS)

for value in data_source():
    pool.submit(process_data, value)

In the code block above, data_source function is an iterator that generates one value at a time. pool.submit calls process_data with value as a parameter of process_data in parallel until thread pool limited in size by MAX_THREADS is exhausted, then the program waits for a thread to become available and fetches the next value, until data_sources generator is exhausted.

In this example pool.submit passes single parameter value to process_data, but pool.submit can pass any number of parameters required for the callee function i.e.

pool_submit(callee_func, callee_param_1, callee_param_n, ... callee_param_n)

I really like this threading implementation because it's really simple, there's no need to write any code to manage threading pool, single- and multi- threaded implementation can live side-by-site if there's ever a need to debug any logic in process_data function.

Still this threading library implemented in python, whic is not true threading and multi-threaded code is a subject of Global Interpreter Lock so the CPU-bound tasks will not benefit from this.

ThreadPoolExecutor documentation -- https://docs.python.org/3/library/concurrent.futures.html

social