d2v_rdd = spark.sparkContext.textFile("")
for row in d2v_rdd.collect():
row_elements = row.split("t")
vector_dict[row_elements[0]] = np.array(row_elements[1:][0])
#Getting the dim features from the products file
products_rdd = spark.sparkContext.textFile("")
for row in products_rdd.collect():
row_elements = row.split("t")
The dataset has 431907 rows
I have the above lines of code implemented in three different forms:
- the python
with open("")
method - reading it into a spark dataframe
spark.read.csv
- the above shown RDD format
I was expecting the code to be faster with a spark dataframe but turns out that the most efficient method is the context manager with open("")
Any reason why this might be happening?
New contributor
Tanishq Jain is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.