Write out the peak memory utilization of a Pyspark Job on EMR to a file
We run a lot of Pyspark jobs on EMR. The pipeline executed is the same, but the inputs can wildly change the peak memory utilization, and that utilization is growing over time. I would like to automatically write out the peak memory utilization of each step submitted to the EMR cluster. If it matters, we are using cluster mode with yarn as the cluster manager. We are also submitting these jobs as Docker containers.