I am training some neural networks in pytorch to use as an embedded surrogate model. Since I am testing various architectures, I want to compare the accuracy of each one, but I am also interested in evaluating the computational time of a single forward pass as accurately as possible.
Below is the structure I have been currently using, but I wonder if it can be done better:
import torch
from time import perf_counter
x = ... # input tensor (n_samples, n_input_features)
model = ... # trained pytorch model
times = [] # empty list to hold evaluation times
# Warm up pytorch:
_ = model(x)
# Timing run:
for i in range(n_samples):
start = perf_counter()
with torch.no_grad():
y_hat = model(x[i])
end = perf_counter()
times.append(end-start)
avg_time = sum(times)/n_samples # average time per run
The reason I evaluate each sample individually in a loop is that in the embedded surrogate model, the model will receive a single set of inputs at a time. This approach seems more applicable in my case, especially to avoid parallel computation with CUDA or MPS for the whole set of samples in x.
I have a few questions regarding this:
-
Can the current structure of my code be improved to maximize the accuracy of the timings?
-
If I have the device of the model and tensors set to MPS, is there a benefit to setting it to CPU when evaluating computation time?
-
Wouldn’t it make sense to confine the evaluation of the model to a specific thread in the CPU to maximize consistency in my readings? Is that even possible?
-
Any other thoughts or suggestions you may have on this?
Thanks in advance for the help!