Background
I had this question, which I have now “solved” with a marginally functional pipeline, but I don’t like the solution.
Quick summary, that link shows two Pub/Sub streams that I want to join. One is a “parent”, and the other has “children” that can join the parent using a shared key. (Ie, it is a 1:N join.)
What I didn’t specify in that original question is that the children will sometimes trickle in very slowly, so that some may come in less than a second after the parent, but some may come as late as a week, a month, or even later than that.
Poor Solution
The compromise that I’ve done, which I am not at all confident about is to create a GlobalWindow
with a repeated trigger and an allowed lateness. Specifically, my windowing is as follows (in Python):
| "Window the parent" # The windowing is the same for the children
>> WindowInto(
window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(10)),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING,
allowed_lateness=3600,
)
The incorrect solution above:
- Ensures that any child can be matched with the parent for all of time, since it is a global window
- Have the data updates processed every 10 seconds if there is any new child or parent/child combo.
- I’m not worrying about having duplicates, I am removing them downstream. So I choose
ACCUMULATING
because I’ll just keep whateverCoGroupByKey
has the most children with it.
- I’m not worrying about having duplicates, I am removing them downstream. So I choose
- And honestly, I’m not sure if the
allowed_lateness
is doing anything. It seems to be irrelevant since I have a global window, but confirmation would be nice.
Downstream of this, I write the results of the CoGroupByKey
into a BigQuery table (after cleaning up formatting, etc).
It is working, insofar as I am getting all of the data and it is not lost. But AFAICT, because I have a GlobalWindow
, the following is happening:
- All data is being kept
- Every 10 seconds, the parent – child grouping is being done again and re-writing to BigQuery
- But due to the global window, it is never purged from Dataflow’s memory / storage.
- But I’m actually not sure on this point.
OK Solution
I would like to create windowing that could do the following:
- Keep the parent and children for 1 week
- The windowing needs to match, so the parent and the children will both be kept for a week, but the children won’t matter, because only 1 parent will be received.
- Any child that is received during that parent window will be processed via
CoGroupByKey
and accumulated with all the previous children so that the entire set of children will be received into BigQuery.- (Or, theoretically, any parent that is received during the 1 week of the child, same thing will happen. But due to the Pub/Sub messages, that won’t realistically happen.)
- Every 10 seconds, output a new
CoGroupByKey
grouping, if there is any new data. - Purge the parent and all of the children with the associated match key after the 1 week time window.
What is the windowing scheme that I should implement to accomplish this?
Ideal Solution
Even better, if it is possible:
- I actually know when a transaction will have no more events.
- Ie, if you think of it like a state machine, I have a state that, when reached, says “terminated”.
- Therefore, the best thing to do would be to use the global window with 10 seconds, same as I’m doing now, and hang on to everything forever, until the element of the PCollection reaches the terminal state, then remove the element from Dataflow’s memory/storage.
- This is because almost all data reaches that terminal state in about 1 minute. The very late stragglers are a very tiny fraction of the data.
I can provide more code, if it is helpful, but this question was already long.