How to perform deduplication with the python record linkage toolkit with large data sets?
I use blocking to trim down the size of the index of record pairs, but sometimes I need to do a full index (or sortedneighborhood on a couple of columns) on a large data set with approx 1M records, which results in a couple billion record pairs.