When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result.
rdd3 = sc.textFile('hdfs://path/data.csv')
header3 = rdd3.first()
rdd3 = rdd3.filter(lambda line: line!=header3)
.map(lambda row: row.split(","))
.map(lambda row: (row[0],row[1],row[3],row[5]))
.collect()
For example if we only keep row[0]
there is no error but if we keep row[1]
it throws a list out of range exception. It’s like the row.split(",")
does not return a 6 element list, which it should.
Any ideas?