Preamble and goals
I have a working algorithm (and working code) for a custom reachable vertices algorithm with memoization on a 3 dimensional graph, but it is not performant enough. My goal is to find ways to make this procedure faster, if possible.
Description
Consider the following problem. Let a graph G
contain V
nodes and E
directed edges between the nodes set up randomly. Let the graph be dense enough such that E -> V^2
and let the graph be able to have cycles. Let K
be a subset of V
, over which we wish to find the set of reachable vertices on each of the verticies in K
.
Let each vertex
v_i
inG
haveS
states(i = {1...S})
, each state having it’s own reachable vertices cache. Given this, we can imagine the graph as a three dimensional graph.
As is, this algorithm is implemented using an iterative DFS that interacts with a cache system. This cache system is simply a structure that holds a cache for each vertex state on the graph. The cache for a vertex contains the reachable vertices for that vertex at that state.
Problematic
My algorithm is critically slow for a non-negligeable set of graph inputs because it spends ~80% of the time sorting. This is especially evident in graphs that have cycles and E->V^2
. The way I implemented it, each time I end the visit on a vertex during the search:
- I take the reachable vertices from the cache of the vertex and I sort and compact, so that the vertices in different states and the vertices that are duplicated might be compacted together.
- I then append the currently visited vertex to the parent’s reachable vertices cache.
- I append the reachable vertices from the current vertex’ cache, to the cache of its parent.
procedure EndVisit(X):
reachable_vertices = cache_get(X)->reachable_vertices_get // N elements
sortAndCompact(reachable_vertices) // K elements | K <= N
// parent from the current DFS stack:
parent_reachable_vertices = parent_cache_get(X)->reachable_vertices_get
parent_reachable_vertices.append(X)
parent_reachable_vertices.append(reachable_vertices)
Iterative example of the algorithm and considerations
Consider the following trivial graph with cycles.
We wish to find the reachable vertices of vertex B (in state 1) and vertex A (in state 1). This is what happens by running my modified iterative DFS procedure (Note that the path which the DFS takes, i.e., which children it goes to, is random)
call ReachableVerticesGet(vertex: B, state: 1):
- [b1]
- [b1,c1]
Call EndVisit ==> c1 {}, b1 {c1}
- [b1]
- [b1,a1]
- [b1,a1,b2]
- [b1,a1,b2,c1]
Call EndVisit ==> c1 {}, b2 {c1}
- [b1,a1,b2]
- [b1,a1,b2,a1] Cycle detected, we mark it
Call EndVisit ==> a1 {}, b2 {c1,a1}
- [b1,a1,b2]
Call EndVisit ==> b2 {c1,a1}, a1 {b2,c1,a1}
- [b1,a1]
- [b1,a1,d1]
- [b1,a1,d1,b2]
- [b1,a1,d1,b2,c1]
Call EndVisit ==> c1 {}, b2 {c1,a1,c1}
- [b1,a1,d1,b2]
- [b1,a1,d1,b2,a1] Cycle detected, we mark it
Call EndVisit ==> a1 {b2,c1,a1}, b2 {c1,a1,c1,a1,b2,c1,a1}
- [b1,a1,d1,b2]
Call EndVisit ==> b2 {b2,c1,a1}, d1 {b2,b2,c1,a1}
- [b1,a1,d1]
Call EndVisit ==> d1 {b2,c1,a1}, a1 {b2,c1,a1,d1,b2,c1,a1}
- [b1,a1]
Call EndVisit ==> a1 {b2,c1,a1,d1}, b1 {c1,a1,b2,c1,a1,d1}
- [b1]
Call EndVisit ==> b1 {c1,a1,b2,d1}
return {c1,a1,b2,d1}
call ReachableVerticesGet(vertex: A, state: 1):
A's cache contains {b2,c1,a1,d1}
return {b2,c1,a1,d1}
Now, this algorithm works. I have many tests, many of which include weird graphs with cycles and randomly constructed graphs and all of them work. There are some degenerate cases, wherein the algorithm will run for an abnormal time despite the graphs being sufficiently small, but nothing critical, the algorithm runs correctly for the vast majority of cases. The problem is that this algorithm is wildly problematic in terms of performance.
- First, If I remove the
SortAndCompact()
function called in eachEndVisit()
, the RAM spikes up to >32gb really fast with no intention of stopping (no memory leak present). - Second, I call
std::sort
everytime I end the visit on a vertex, and in the context of graphs that have cycles and vertices that have multiple states, like the one I gave in this example, sometimes the sort is called multiple times for the same vertex. Of course, I would love to not callstd::sort
since it’sNlogN
, but the more I stare at graphs for the problem I’m trying to solve, the morestd::sort
made sense at the time. There must exist better solutions, nevertheless.
What I have tried so far
- The most obvious is to use a
set/unordered_set
instead of a vector for the reachable vertices cache at each vertex. However, for my use cases, this does not give any reasonable improvement. I even tried using custom hash functions. I profiled and it seems that the time is migrated fromstd::sort
tostd::set::insert
. - I tried parallelizing the algorithm by cloning the caches, but then merging the caches was a nightmare and overall I dropped the investigation on this solution, there’s too much shared state.
- I tried using
std::pmr
(polymorphic resource type) to use anunordered_map
on the stack every time I want to do anstd::sort
. The map is used to compact the duplicates and replacesstd::sort
. This gives ~25% improvement in many cases, but still not satisfactory in my opinion. - I considered running a polling system that accumulates reachable vertices caches, and invokes
SortAndCompact()
on them when the system RAM reaches a certain threshold. This was a big hit on performance and I dropped the idea as well. - Most of the input graphs that I use for my application have strongly connected components, I tried to do some preprocessing that would set vertex priorities on the DFS accordingly, but without quantifiable improvement.
- I considered rewriting everything from scratch to implement a new algorithm, but this would take a lot of time, I opted to look for ways to improve the performance instead.
- I tried a wide range of different things and at this point I’ve been on this for many months.
I tried many tricks overall involving
- reducing the calls to
std::sort
- eliminating the calls to
std::sort
- improving the speed to the calls to
std::sort
but I’m at loss of ideas. I fear that I may have to replace this reachability algorithm all together but I’m looking for a second opinion before I drop this all together.