I am trying to apply multiprocessing to a function to calculate the Shannon entropy for every position of a DNA alignment to speed up the process. It just keeps running without returning anything and I can not figure out what is wrong with my code.
Specifically, I’d like to incorporate Multiprocessing.Pool
into my larger entropy class object that can be instantiated given any alignment file. I’m not sure if that’s possible as I understand it’s not recommended to use if __name__ == '__main__':
in the class method __init__
. I have been running this code in Jupyter, but I tried outside of a Jupyter environment as well running python 3.8.5
When I run the code below, I get the entropy array returned from the serial computation. However, when the parallel computation runs, I get no returns or errors, just infinite running.
I am pretty new to multiprocessing so any help would be greatly appreciated. Please let me know if it’s impossible to directly incorporate this into a class object without using a subprocess call or something to call a separate script.
from Bio import AlignIO
from multiprocessing import Pool
from scipy.stats import entropy
import pandas as pd
from time import perf_counter
path = 'path_to_aln_fasta'
alignment = AlignIO.read(path, 'fasta')
def calc_entropy(column):
series = pd.Series(column)
vals = series.value_counts(normalize=False)
return entropy(vals)
def pool_calc_entropy(alignment):
pool = Pool(8)
entropies = pool.map(calc_entropy, [alignment[:, i] for i in range(alignment.get_alignment_length())])
return entropies
start = perf_counter()
entropies_series = []
for i in range(alignment.get_alignment_length()):
column = alignment[:, i]
entropies_series.append(calc_entropy(column))
end = perf_counter()
series_time = end-start
print(f'series time: {series_time}')
start = perf_counter()
entropies_pool = pool_calc_entropy(alignment)
end = perf_counter()
parallel_time = end-start
print(f'parallel time: {parallel_time}')
print(f'pool time is {round((parallel_time)/(series_time), 2)} times faster than series time')
input example 'path_to_aln_fasta'
:
>record_one
GGTTTGATGTATATGCTCATTACTTATTTGGATTTTTGATAGAAATAAAACAGCTTTAAGCGCTTCATTACAAGTCACGCAGTTACACCTGTATCGCAGTGTTTACCATCTTGAGGGTAATTTTGCATTAGTGGACAGCTAGACTCAGCCACGCCTATACCCTTCGGAACATCGAAACATACCAACCATGATATCGTCGA
>record_two
ATGACTACTCGGTTACTTCTAATTGTTGCTAGCTGTACTTTGTGAAGTGAACAATAAGGTTTTAAAACCGTTAGACACAGTGCCTCTTTTAGGAAGATTCTTGATAATTTGCCCTGGTTTATGCATCGTTTGCACAGATTGACTTAAAGACTACGTCATAATAACTGCGCCTTGACTGCGGATTGAGCAATGCCTGCACA
>record_three
AGTAAATTTTATAGAGGTATTGCGCACCCTCGGAACTTTACAAAGCACAACCTGGATCACACTATTTCCGAGACCCACCTGACTGTGTTGTCGAGTTATCGATCCTATATATTTAAAGTTGGTTTTCAAAACATTTTTCGTTTGCGTTGTATACGGAGCCGTAGATAAGCGCATTTCGGTCATCAATCAGGGACAAATTA
>record_four
TTGAAATTAACATGCAACCTGACATGGTTTATTCTGCGGCATACAGAGAAATTGTGTTGATTGAACTGATTAGGATACTCCGGTCCCGTGTTTGAATTGTGCTAAAACCTGGTTAGAGATCGCGTCAATTCTCCATCCAGAAGAAGAACAATTTTAGGACCCGTTGTCATTTCCCTTAAAGGTTTTTGTAACAGTCCGAT
8