I have a spark job which first is supposed to infer the schema then do the real “job”. To infer the schema we use
sqlContext.sparkSession.read.json(df.select($"columns").as[String])
Most of the time it works fine, but ocasionally, the job gets cancelled, which makes this fail.
I don’t have access to the actual schema, but I can say this about it:
- This happens in many situations with completely different schema
- This happens randomly, retrying this works for the same table, it fails randomly and not very often
The logs are very sparse. It looks like this (all from the driver):
...
Submitting 1 missing tasks from ResultStage 805119
Adding task set 805119.0 with 1 tasks resource profile 0
Asked to cancel job 225796
(More logs related to cancellation and eventual job failure)
As far as I can see, no reason is provided for the cancellation, I also don’t think it’s possible that it was manually cancelled, this happens too often.
I don’t think an executor failed, firstly because it doesn’t look like any executor picked up the task (I would see Got assigned task
as a log from an executor) but also because it doesn’t look like any executor died
I also don’t think it’s any sort of timeout, because these logs are almost instantaneous, and the job start is less than a few seconds before these. The timing between the task creation and the cancelled job is less than 1 second.
Another point is that the actual job once the schema is inferred doesn’t get cancelled, or at least not as frequently, but maybe I’m just mis-filtering it.
What could be causing this cancellation?
Which steps can I take to diagnose this?
4