Repost from here as I think it may be more suited to this exchange.
I’m trying to implement DRUM (Disk Repository with Update Management) as per the IRLBot paper (relevant pages start at 4) but as quick summary it’s essentially just an efficient way of batch updating (key, value) pairs against a persistent repository. In the linked paper it’s used as the backbone behind the crawler’s URLSeen
test, RobotsTxt
check and DNS cache.
There has helpfully been an implementation in c++ done here, which lays out the architecture in a much more digestable way. For ease of reference, this is the architecture diagram from the c++ implementation:
The part which I’m struggling to understand is the reasoning behind keeping the (key, value) buckets and auxiliary buckets separate. The article with the c++ implementation states the following:
During merge a key/value bucket is read into a separate buffer and
sorted. Its content is synchronized with that of the persistent
repository. Checks and updates happen at this moment. Afterwards, the
buffer is re-sorted to its original order so that key/value pairs
match again the corresponding auxiliary bucket. A dispatching
mechanism then forwards the key, value and auxiliary for further
processing along with the operation result. This process repeats for
all buckets sequentially.
So, if the order of the (key, value) buckets need to be restored to that of the auxiliary buckets in order to re-link the (key, value) pairs with the auxiliary information, why not just keep the (key, value, aux) values together in singular buckets? What is the reasoning behind keeping them separate and wouldn’t it be more efficient to just keep them together (since you no longer need to restore the original unsorted order of the bucket)?