Given input dataframe with structure:
| machine_id | process_id | activity_type | timestamp |
| ---------- | ---------- | ------------- | --------- |
| 0 | 0 | start | 0.712 |
| 0 | 0 | end | 1.52 |
| 0 | 1 | start | 3.14 |
| 0 | 1 | end | 4.12 |
| 1 | 0 | start | 0.55 |
| 1 | 0 | end | 1.55 |
The task is to calculate average time of process per machine.
The solution is to calculate difference between end and start timestamp for each process, then define average for this difference per each machine_id.
This is the working code:
machines.as("m1")
.join(machines.as("m2"), $"m1.machine_id" === $"m2.machine_id"
&& $"m1.process_id" === $"m2.process_id" && $"m2.timestamp" > $"m1.timestamp")
.withColumn("dif", $"m2.timestamp" - $"m1.timestamp")
.groupBy($"m1.machine_id").avg("dif")
.show()
It produced the output:
+----------+--------+
|machine_id|avg(sum)|
+----------+--------+
| 0| 0.894|
| 1| 0.995|
| 2| 1.456|
+----------+--------+
It confuses me that I was forced to create addition column “dif”. This column contain the result of difference $”m2.timestamp” – $”m1.timestamp”.
I tried to pass this difference as argument of avg-function:
.avg($"m2.timestamp" - $"m1.timestamp")
But this does not work. That’s why I used
.withColumn("dif", $"m2.timestamp" - $"m1.timestamp")
In view of that additional column entails extra memory utilization, is there any other way to calculate .avg() of expression from self-join without additional columns creation?