I want to use label smoothing regularization for my NMT task. As suggested the way it is done is as follows
loss_object = CategoricalCrossentropy(from_logits=True, reduction='none')
def masked_loss_function(y_true, pred, smoothing_factor=0.1):
mask = tf.logical_not(tf.equal(y_true, fa_tokenizer.pad_token_id))
y_hot = (1 - smoothing_factor) * tf.one_hot(y_true, depth=VOCAB_TARG_SIZE) + (smoothing_factor / VOCAB_TARG_SIZE)
loss = loss_object(y_hot, pred)
mask = tf.cast(mask, loss.dtype)
loss *= mask
return tf.reduce_sum(loss) / tf.reduce_sum(mask)
the problem is that I am using RTX 2060 notebook with 6GB of ram and one_hot encoding a vector that explodes to a shape like [64, 30, 25000] is expensive (I am using float32).
I was looking for an alternative way to carry these computations sparsely. Can anybody help me?
I tried choosing y_pred from y_true to carry
chosen = choose positive indices from y_pred
$
-sum_i (1-smoothing_factor) times log(chosen) + (frac{smoothing_factor}{num_classes} log(all))
$