I am very new to spark so bear with me. I am currently trying to hash feature vectors generated by the CountVectorizer. So for the following example with a hash size of 50:
+---+--------------------+-----------------------------+
|id |features |featuresHashed |
+---+--------------------+-----------------------------+
|0 |(3,[],[]) |(50,[0,32,41],[0.0,0.0,0.0]) |
|2 |(3,[2],[1.0]) |(50,[0,32,41],[0.0,1.0,0.0]) |
|3 |(3,[0,2],[1.0,-1.0])|(50,[0,32,41],[1.0,-1.0,0.0])|
+---+--------------------+-----------------------------+
Where featuresHashed is the new column. I am having a hard time doing this without a UDF which contains a nested for loop or all of the logic in general, which makes the code very slow for large datasets.