I use blocking to trim down the size of the index of record pairs, but sometimes I need to do a full index (or sortedneighborhood on a couple of columns) on a large data set with approx 1M records, which results in a couple billion record pairs.
The error is “Unable to allocate 1.16 TiB for an array with shape (159857899410,) and data type int64”
The workstation I am currently using (M5.4xlarge 16 core and 64 GB RAM) is running out of memory and is unable to store the full large billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, but it doesn’t provide anything for deduplication within a single dataframe.
I would greatly appreciate the suggestions and assistance in resolving this memory allocation issue during runtime.