I have a kind of huge dataset (about 100GB) based on blockchain data. I want to merge two tables based on the transactionHash
, which would be impossible (O(n^2)) except because these two tables are both ordered by blockNumber
, so this can be done in O(|A|+|B|)=O(n).
The tables would have something like this, among other columns
blockNumber | transactionHash | |
---|---|---|
0 | 1 | 0x0 |
1 | 1 | 0x1 |
2 | 1 | 0x6 |
3 | 2 | 0x2 |
4 | 2 | 0x8 |
5 | 2 | 0xf |
In pandas, I would use pd.merge_ordered
, but it seems this is not available in dask.
How can I either:
- a) Implement this algorithm myself, what should I take into account and how can I “align” partitions
- b) Use an alternative with other dask functions
I can’t use .set_index() because I have two columns (not one), and even setting the index to just the timestamp column freezes the IPython kernel.