I’m constructing a big df out of multiple csv files. The only differences between those files should be slightly differing order of the col names. I stuck them in a recursive loop and put them into dfs. After some basic clean-up operations, I concatenated the multiple dfs and called pandas’s to_csv method. The result is a df with apparently 45 columns, of which all are present in the data, yet two are missing both their index and their name when I call df.info(). It seems the two problematic columns are the only ones put into quotation marks by pandas, those being the last col “co2-ausstoß” (this one I can imagine being because of the weird character, though I’m not sure) and the col “additional”, which I can refer to in my code with “additionalr” (while having another additional-column, without quotation marks, also present in df).
My code for reference (I omitted methods I ran on specific columns not referring to “co2-ausstoß” and “additional”):
Input:
autohero_data_list = []
for entry in files:
df = pd.read_csv(entry, lineterminator='n', sep=";", header=0)
df = df[df.listing_price.notnull()]
df = df[df.highlights.notnull()]
df.columns = df.columns.str.lower().str.replace(' ','_')
df = df.reset_index(drop=True)
autohero_data_list.append(df)
df_clean_autohero = pd.concat(autohero_data_list)
path = "./data/data_cleaned"
df_clean_autohero.to_csv(os.path.join(path,r"autohero_data_clean"), encoding='utf-8', index=False)
df_clean_autohero.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Index: 10173 entries, 0 to 108
Data columns (total 45 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 url 10173 non-null object
1 scrape_date 10173 non-null object
2 make 10173 non-null object
3 model 10173 non-null object
4 listing_price 10173 non-null float64
.
.
.
35 highlights 10173 non-null object
36 comfort 10173 non-null object
37 multimedia 10170 non-null object
38 light-and-sight 10165 non-null object
39 security 10165 non-null object
6837 non-null object
41 co2-effizienzklasse 10173 non-null object
42 additional 3330 non-null object
43 verbrauch 1 non-null object
119 non-null object
dtypes: float64(4), object(41)
memory usage: 3.6+ MB
My issue is then of course at #40 and #44, and I don’t understand what’s wrong here. Also, you can see that there is another “additional” column in the df, as mentioned – for some reason, the “additional” columns, which contain a list of values, were not merged within the loop.
For the two columns containing “additional” in the col name, I tried finding out whether they were identical or possibly inverted, by calling df_new["additionalr"].isna().sum()
and df_new["additional"].isna().sum()
, yielding 3336
and 6843
as results, respectively. Finding the len(df_new)
and summing up the two results from the previous methods reveals that the result is off by 6 entries – meaning that the two columns with “additional” in the name are not completely just inverted, though possibly close.
This is my first question here, so I hope I gave all relevant info, otherwise let me know what else I can provide. Any input or suggestion would be highly appreciated!
user24923167 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.