df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='t').load("data.txt", schema=oldSchema)
I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: like df.columns = new_column_name_list
But its not working in DF. I put above code.
This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.
Is there a better and more efficient way to do this like we do in pandas?
PySparkLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1