I am trying to use Vertex AI Tensorflow Profiler to profile my custom training job based on this documentation.
My custom job runs successfully to completion, but I am unable to successfully capture a Profile in Vertex AI Tensorboard despite following the steps in the “Capture a profiling session” section of the documentation.
When I click “Capture” in the Profile section on Vertex AI Tensorboard after following these above steps, I receive an error which looks like:
Failed to capture profile: 401 Client Error: Unauthorized for url: https://….aiplatform-training.googleusercontent.com/profile/capture_profile?service_addr=workerpool0-0&is_tpu_name=false&duration=1000&num_retry=3&worker_list=&host_tracer_level=2&device_tracer_level=1&python_tracer_level=0&delay=0 Invalid OAuth Token . For information on how to setup the profiler, please visit: https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-profiler
The documentation linked above references the roles/storage.admin and roles/aiplatform.user service account roles. Both my own service account and the service account used to run the custom training job have both of these roles.
Are there additional permissions required in order to successfully capture a profiling session? Any help/advice on solving this issue would be greatly appreciated!
I tried:
- Checking the Vertex AI platform GitHub repo for any issues like this but couldn’t find any.
- Changing the service account used to run the training job to be the same as the User (myself) trying to Capture the Profiling session within Vertex AI Tensorboard while the training job is running, but this doesn’t seem to be possible on Google Cloud Platform.