We have a business use case that basically boils down to the following CS problem:
Given a directed graph F
and a subset of vertices / subgraph G
= v1
, v2
, … , vk
we want to find any tree T
that spans G
(i.e. all of v1
, v2
, … , vk
are all vertices of T
) and its root.
To give you an example, given following graph F
,
G
= {
plan_description
,
person_id (+ temporal condition on effective_date | submission_date}
,
benefit_code
},
the algorithm should output the tree that looks like
and return person_id (+ temporal condition on effective_date | submission_date}
as the root.
For context, we are building a visual query language over our Data Lake that allows users to ask for a sink column (e.g. plan_id
) subject to conditions on other columns (e.g. benefit_code
= MED
). We want to identify the root of this spanning tree so that we can determine the join order for the SQL that we need to generate. In this example, that join order would be:
- Do a join from the orange to the red table.
- Do a join from ^ to the green table.
or
- Do a join from the orange to the green table
- Do a join from ^ to the red table
Does this problem abstraction fit into a well-studied class of graph algorithms?