Filter strings with high proportion of lowercase letters
I have quite a large df (50+ million) with one of the columns containing DNA sequences (1 DNA sequence per row). Some of these sequences contain a mix of lowercase and uppercase letters. I would like to have my dataset only have sequences with 50% or more uppercase letters (take out the seqs with 50% or more lowercase).
I took a small subset of my DF and it took 2 minutes just to filter out the sequences. I was hoping that I could find a more efficient way so that I can scale up.