I have quite a large df (50+ million) with one of the columns containing DNA sequences (1 DNA sequence per row). Some of these sequences contain a mix of lowercase and uppercase letters. I would like to have my dataset only have sequences with 50% or more uppercase letters (take out the seqs with 50% or more lowercase).
I took a small subset of my DF and it took 2 minutes just to filter out the sequences. I was hoping that I could find a more efficient way so that I can scale up.
Example of my DF:
label sequence
1 aaaggGtTt...
0 AAAggccCCC...
Here is the function I am using.
def remove_low_complexity_seqs(sequence, threshold=0.5):
"""
Check if more than a given threshold proportion of the sequence is lowercase (low complexity).
Args:
- sequence (str): The nucleotide sequence.
- threshold (float): The proportion threshold (default is 0.5 for 50%).
Returns:
- bool: True if more than threshold proportion is lowercase, otherwise False.
"""
lowercase_count = sum(map(str.islower, sequence))
proportion = lowercase_count / (10000) #10k is the length of all seqs
return proportion > threshold
Code I ran:
# mask = control_seqs['sequence'].apply(lambda seq: not remove_low_complexity_seqs(seq, context)) # long runtime 115secs
# control_seqs = control_seqs[mask] # quick runtime