I’m trying to select specific rows from a Pandas
dataframe using a string to match values. As I’m running this repeatedly then I’m looking for the fastest way to do this. I’ve found that I can hash the string value and subset on this value to speed things up. Is there a better approach with improved performance?
import pandas as pd
import random
from string import ascii_letters
from timeit import default_timer as timer
# Create dataframe
df = pd.DataFrame(data=list(range(100000000)), columns=["id"])
df['searchtext'] = [''.join(random.choice(ascii_letters) for x in range(3)) for _ in range(len(df))]
df['hash'] = df.searchtext.transform(hash)
print(df)
# Standard text search
start = timer()
df[df.searchtext == 'aaa']
end = timer()
print(f'text search took {end-start} seconds')
# Search using hash
start = timer()
df[df.hash == hash('aaa')]
end = timer()
print(f'text search took {end-start} seconds')
The hash search seems to be about 50-60 times faster.