I have an ONNX model generated from a HuffingFace transformer.
I’m trying to replicate the behavior of the model using Transformers PyTorch and is working more or less fine.
But now I’m trying to reduce the model size, using ONNX dynamic quantization.
So, I’m adding the dynamic quantization also in the parallel PyTorch code, using:
hf_tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)
hf_model = AutoModel.from_pretrained(HF_MODEL)
# if False:
if ONNX_EXPORT_QUANTIZE:
hf_device = torch.device("cpu") # Standard TensorFlow backend does not support quantization in GPU
hf_model= torch.quantization.quantize_dynamic(
hf_model, # el modelo a cuantizar
{torch.nn.Linear,torch.nn.LayerNorm}, # Specify layers to quantize
dtype=torch.qint8 # Use INT8 for weights
)
hf_model = hf_model.to(hf_device)
I’m near, but it’s not matching totally. I have checked with Netron, and I see that ONNX dynamic quantization adds Quantization after a lot of operators (Mult, Reshape, …) that are not layers in PyTorch. Which layers have I to add in torch.quantization.quantize_dynamic
to reproduce the ONNX behavior?