Problem Definition:
I’m working on a pipeline where I need to read data from a BigQuery table, transform it, and extract appName from a url attribute. After that, I need to calculate the total duration of each appName per day using two timestamps (session_started and session_closed).
One challenge I’m facing is that events can repeat after different events. For example, a “Baseball” event may occur for the first 5 rows and last 40 minutes, then another event occurs, and later “Baseball” appears again. I need to handle such cases correctly when calculating the total duration of each event per day.
I’m unsure if Apache Beam is the right tool for this task, or if it’s feasible to implement this logic in a single pipeline. Could someone suggest the best approach or help me with this logic? Any guidance would be appreciated!
2