After scraping these tables into a lakehouse, I am having a hard time changing the concatenated structure:
table before
Into an Actual table with my values:
table after
Apparently each row is concatenated into a cell.
ex: cell1 [“$) October 1, 2021″,”34,969 to 60,431″,”64,042″,”66,432″,”68,824″,”71,212″,”73,607″,”75,500”]
should be $) October 1, 2021 | 34,969 to 60,431 | 64,042 |66,432 68,824 |71,212 |73,607 |75,500
and cell 2 should proceed in the second row etc..
the code I am using after loading table_before into a df is:
# Get the column names from df_1
df_1_columns = df_1.columns
# Get the first row of df_1 (assuming this row contains the data to be used for rows)
row_data = df_1.collect()[0] # Get the first row
# Initialize an empty list to store the new rows
new_rows = []
# Iterate through each cell in the row
for cell in row_data:
if isinstance(cell, str): # Ensure the cell is a string
# Assuming the cell contains a comma-separated list of values
list_data = cell.split(',') # Split the string by commas
# Create a new Row object from the elements of the list
new_row = Row(*list_data)
new_rows.append(new_row)
# Convert the new rows to a DataFrame
new_df = spark.createDataFrame(new_rows)
# Rename new_df columns using df_1's columns
new_df_renamed = new_df.toDF(*df_1_columns)
# Show the transformed DataFrame
new_df_renamed.show(truncate=False)`
but I am getting the following error:
ValueError: can not infer schema from empty dataset
Also that it’s not a string but an array(string)
tried a multiple things and can’t get to the right solution yet.