i have a challenge today, is:
Having a list of s3 paths, inside a list, split this and get a dataframe with one column with the path and a new column with just the name of the folder.
my list have the next content:
raw/ingest_date=20240918/eventos/
raw/ingest_date=20240918/llamadas/
raw/ingest_date=20240918/campanhas/
raw/ingest_date=20240918/miembros/
raw/ingest_date=20240918/objetivos/
i try this code:
new_dict = []
for folder in subfolders:
new_dict.append(folder)
name = folder.split("/", -1)
new_dict.append(name[2])
#print(name)
print(type(new_dict))
for elem in new_dict:
print(elem)
df = spark.createDataFrame(new_dict, ["s3_prefix", "table_name"])
df.show()
but the result is a list like:
raw/ingest_date=20240918/eventos/
eventos
raw/ingest_date=20240918/llamadas/
llamadas
raw/ingest_date=20240918/campanhas/
campanhas
...
...
but when I try to print my dataframe i see this:
TypeError: Can not infer schema for type: <class ‘str’>
the idea is have a dataframe like :
s3_prefix | table_name
------------------------------------------------------
raw/ingest_date=20240918/eventos/ | eventos
raw/ingest_date=20240918/llamadas/ | llamadas
raw/ingest_date=20240918/campanhas/ | campanhas
raw/ingest_date=20240918/miembros/ | miembros
Can somebody give a hand to resolve this?
Regards