I’m trying to remap primary keys between versioned data and code releases with as small a diff as possible.
I have a list of elements, e.g. [a,b,c,...,j]
, and some deterministic process that assigns these elements to groups, e.g. g0_0={a,b,c}, g0_1={d,e,f,g,h}, g0_2={i}, and g0_3={j}
.
Later, I add elements to the list, e.g. [k,l]
, and I improve my process, e.g. here my groups are g1_0={a,b,c,d}, g1_1={e,f,g,h,i,k}, g1_2={j,l}
.
How can I efficiently obtain a (many-to-many) mapping from old groups to new groups that minimises the differences between the mapped sets?
In this example we would see:
g0_0 |-> g1_0: diff=1
g0_1,g0_2 |-> g1_1: diff=2
g0_3 |-> g1_2: diff=1
total_diff = 4
I could assemble this as a mixed integer optimisation problem with a bipartite graph:
g0_0 g0_1 g0_2 g0_3
| | / /
g1_0 g1_1 g_1_2
where edges have discrete values [0,1], and I minimise the total difference.
However, in this application, my number of groups is much larger. I expect my adjacency matrix of size count(g0_*)
times count(g1_*)
will be too large to make this approach tractible.
4