I want to implement a hashed cross product transformation like the one Keras uses:
>>> layer = keras.layers.HashedCrossing(num_bins=5, output_mode='one_hot')
>>> feat1 = np.array([1, 5, 2, 1, 4])
>>> feat2 = np.array([2, 9, 42, 37, 8])
>>> layer((feat1, feat2))
<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0.]], dtype=float32)>
>>> layer2 = keras.layers.HashedCrossing(num_bins=5, output_mode='int')
>>> layer2((feat1, feat2))
<tf.Tensor: shape=(5,), dtype=int64, numpy=array([2, 0, 4, 0, 2])>
This layer performs crosses of categorical features using the “hashing trick”. Conceptually, the transformation can be thought of as: hash(concatenate(features)) % num_bins.
I’m struggling to understand the concatenate(features)
part. Do I have to do the hash of each “pair” of features?
In the meantime, I tried with this:
>>> cross_product_idx = (feat1*feat2.max()+1 + feat2) % num_bins
>>> cross_product = nn.functional.one_hot(cross_product_idx, num_bins)
It works, but not using a hash function can cause problems with distributions