I have a pyspark DataFrame with two columns. One is a float and another one is an array.
I know that the length of the array in each row is the same length as the the number of rows.
I want to create a new column in the DataFrame that for each row the result will be the dot product of the array and the column.
For example, for the following DataFrame:
+------------------------------------------------------------+-----+
|weights |value|
+------------------------------------------------------------+-----+
|[0.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0]|34 |
|[5.0, 0.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0, 1.0, 2.0, 3.0, 4.0]|50 |
|[4.0, 5.0, 0.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0, 1.0, 2.0, 3.0]|56 |
|[3.0, 4.0, 5.0, 0.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0, 1.0, 2.0]|45 |
|[2.0, 3.0, 4.0, 5.0, 0.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0, 1.0]|34 |
|[1.0, 2.0, 3.0, 4.0, 5.0, 0.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0]|36 |
|[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 0.0, 5.0, 4.0, 3.0, 2.0, 1.0]|45 |
|[1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 0.0, 5.0, 4.0, 3.0, 2.0]|50 |
|[2.0, 1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 0.0, 5.0, 4.0, 3.0]|57 |
|[3.0, 2.0, 1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 0.0, 5.0, 4.0]|39 |
|[4.0, 3.0, 2.0, 1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 0.0, 5.0]|48 |
|[5.0, 4.0, 3.0, 2.0, 1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 0.0]|39 |
+------------------------------------------------------------+-----+
I want to add a new column ‘result’ and the value of each row will be:
numpy.dot(row['weights'] * [34, 50, 56, 45, 34, 36, 45, 50, 57, 39, 48, 39])
Thanks.