Thiết kế website giá rẻ

Question

Background

I had this question, which I have now “solved” with a marginally functional pipeline, but I don’t like the solution.

Quick summary, that link shows two Pub/Sub streams that I want to join. One is a “parent”, and the other has “children” that can join the parent using a shared key. (Ie, it is a 1:N join.)

What I didn’t specify in that original question is that the children will sometimes trickle in very slowly, so that some may come in less than a second after the parent, but some may come as late as a week, a month, or even later than that.

Poor Solution

The compromise that I’ve done, which I am not at all confident about is to create a GlobalWindow with a repeated trigger and an allowed lateness. Specifically, my windowing is as follows (in Python):

<code> | "Window the parent" # The windowing is the same for the children

>> WindowInto(

window.GlobalWindows(),

trigger=trigger.Repeatedly(trigger.AfterProcessingTime(10)),

accumulation_mode=trigger.AccumulationMode.ACCUMULATING,

allowed_lateness=3600,

)

</code>

<code> | "Window the parent" # The windowing is the same for the children >> WindowInto( window.GlobalWindows(), trigger=trigger.Repeatedly(trigger.AfterProcessingTime(10)), accumulation_mode=trigger.AccumulationMode.ACCUMULATING, allowed_lateness=3600, ) </code>

        | "Window the parent" # The windowing is the same for the children
        >> WindowInto(
            window.GlobalWindows(),
            trigger=trigger.Repeatedly(trigger.AfterProcessingTime(10)),
            accumulation_mode=trigger.AccumulationMode.ACCUMULATING,
            allowed_lateness=3600,
        )

The incorrect solution above:

Ensures that any child can be matched with the parent for all of time, since it is a global window
Have the data updates processed every 10 seconds if there is any new child or parent/child combo.
- I’m not worrying about having duplicates, I am removing them downstream. So I choose ACCUMULATING because I’ll just keep whatever CoGroupByKey has the most children with it.
And honestly, I’m not sure if the allowed_lateness is doing anything. It seems to be irrelevant since I have a global window, but confirmation would be nice.

Downstream of this, I write the results of the CoGroupByKey into a BigQuery table (after cleaning up formatting, etc).

It is working, insofar as I am getting all of the data and it is not lost. But AFAICT, because I have a GlobalWindow, the following is happening:

All data is being kept
Every 10 seconds, the parent – child grouping is being done again and re-writing to BigQuery
But due to the global window, it is never purged from Dataflow’s memory / storage.
- But I’m actually not sure on this point.

OK Solution

I would like to create windowing that could do the following:

Keep the parent and children for 1 week
- The windowing needs to match, so the parent and the children will both be kept for a week, but the children won’t matter, because only 1 parent will be received.
- Any child that is received during that parent window will be processed via CoGroupByKey and accumulated with all the previous children so that the entire set of children will be received into BigQuery.
  - (Or, theoretically, any parent that is received during the 1 week of the child, same thing will happen. But due to the Pub/Sub messages, that won’t realistically happen.)
Every 10 seconds, output a new CoGroupByKey grouping, if there is any new data.
Purge the parent and all of the children with the associated match key after the 1 week time window.

What is the windowing scheme that I should implement to accomplish this?

Ideal Solution

Even better, if it is possible:

I actually know when a transaction will have no more events.
- Ie, if you think of it like a state machine, I have a state that, when reached, says “terminated”.
Therefore, the best thing to do would be to use the global window with 10 seconds, same as I’m doing now, and hang on to everything forever, until the element of the PCollection reaches the terminal state, then remove the element from Dataflow’s memory/storage.
- This is because almost all data reaches that terminal state in about 1 minute. The very late stragglers are a very tiny fraction of the data.

I can provide more code, if it is helpful, but this question was already long.

Thiết kế website giá rẻ

Danh mục

Proper windowing to use in Apache Beam / Dataflow merge of two Pub/Sub streams

Background

Poor Solution

OK Solution

Ideal Solution