My question is on the best way to cluster a graph of class instances around specifically marked objects (objects are the graph nodes and the references to each other are the directed edges of the graph). To better explain my question, let me explain my motivation:
I currently use a moderately complex system to serialize the data used in my projects:
- “marked” objects have a specific attributes which stores a “saving entry”: the path to an associated file on disc (but it could be done for any storage type providing the suitable interface)
- Those object can then be serialized automatically (eg:
obj.save()
) - The serialization of a marked object
'a'
contains implicitly all objects'b'
for which'a'
has a reference to, directly s.t:a.b = b
, or indirectly s.t.:a.c.b = b
for some object'c'
This basically define specific storage entries to specific objects. I then have “container” type objects that:
- can be serialized similarly (in fact their are or can-be “marked”)
- they don’t serialize in their storage entries the “marked” objects (with direct reference): if
a
anda.b
are both marked,a.save()
callsb.save()
and storesa.b = storage_entry(b)
So, if I serialize 'a'
, it will serialize automatically all objects that can be reached from 'a'
through the object reference graph, possibly in multiples entries. That is what I want, and it usually provides the functionalities I need. However, there are some structural limitations to this approach:
- the multi-entry saving can only works through direct connections in “container” objects, and
- there are situations with undefined behavior such as if two “marked” objects
'a'
and'b'
both have a reference to an unmarked object'c'
. In this case my system will stores'c'
in both'a'
and'b'
making an implicit copy which not only double the storage size, but also change the object reference graph after re-loading.
I am thinking of generalizing the process. Apart for the practical questions on implementation (I am coding in python, and use Pickle to serialize my objects), there is a general question on the way to attach (cluster) unmarked objects to marked ones.
So, my questions are:
- What are the important issues that should be considered? Basically why not just use any graph parsing algorithm with the “attach to last marked node” behavior.
- Is there any work done on this problem, practical or theoretical, that I should be aware of?
Example of a simple object graph to serialize:
Circles are objects and arrow references. Orange circles are marked objects.
When serializing A
, object B
to E
should be serialized but not blue objects because they are not reachable from A
. Now, B
is serialized in its own entry (i.e. file) and the question is how to choose where to serialize objects C
to E
, to A
or B
entry?
Some thinking on this example:
- object
C
is only reachable fromA
, so its should be attached to it - to reach
E
, it is necessary to pass byB
, soE
should be attached toB
- what about
D
? Because after serialization ifB
is loaded (and notA
),D
should also be loaded but notA
. ThusD
should be serialized inB
entry.
Note:
I added the tag database
because I think the answer might come from that fields, even if the question is not really.
0
It looks like there’s a missing conceptual step going on here. An object reference is not the object itself. So, perhaps consider serializing object references, and don’t serialize object c inside a or b, for example. Serialize object c outside of them, and you save space and processing time.
Edit:
Basically, it would work like this:
foreach(object reference in serializeable)
serialize(object in global)
serialize(object reference)
Then, when you deserialize, you just make the object reference point back to the correct object.
9