I am trying to figure out how to use the code I have written, specifically the loop section of the code to return all unique values in this excel file, but performing this calculation bu subtracting the years 2018 and 2021, just leaving the difference value next to the region and country columns.
I know that I have to use a loop in order to do this, but I am unsure of how best to go about it, I have tried different approaches, such as using a nested loop, but it just produces either duplicate or wrong values, and in some cases, the calculation doesn’t even take place.
Here is a example of the data I am working with:
df = pd.DataFrame({'Country': ['Austria', 'Austria', 'Ireland', 'Ireland'],
'REGION_2D': ['AT10', 'AT10', '4', '4'],
'value': [850.01874, 869.550869, 545.4036925, 560.695257],
'YEAR': [2018, 2021, 2018, 2021]})
Here is my code:
import pandas as pd
import openpyxl as op
df = pd.read_excel(r"C:UsersblakecarPycharmProjectsLFS DataMock File Population.xlsx", sheet_name='Sheet1')
df["COUNTRY"] = df["COUNTRY"].astype(pd.StringDtype())
df["REGION_2D"] = df["REGION_2D"].astype(pd.StringDtype())
df["Key"] = df["Key"].astype(pd.StringDtype())
df["Variable"] = df["Variable"].astype(pd.StringDtype())
df["variable_category"] = df["variable_category"].astype(pd.StringDtype())
df["YEAR"] = df["YEAR"].astype(pd.StringDtype())
print(df.dtypes)
#output dataframe to print the exel file to
output = pd.DataFrame()
output["COUNTRY"] = ""
output['REGION_2D'] = " "
output["Variable"] = ''
output['Key'] = ''
output["AC_Change"] = ''
def func1(geog):
print("------------")
print(geog)
print("------------")
filtered_df = df[df["COUNTRY"].isin([geog])]
filtered_df = filtered_df[filtered_df["Variable"].isin(["people_15_64_W"])]
#filtered_df = filtered_df[filtered_df["variable_category"].isin(['None'])]
#filter through 2018 and summed it up
filtered_2018 = filtered_df[filtered_df["YEAR"].isin(["2018"])]
total_2018 = filtered_2018['value'].sum()
#print(total_2018)
#filter through 2021 and summed it up
filtered_2021 = filtered_df[filtered_df["YEAR"].isin(["2021"])]
total_2021 = filtered_2021['value'].sum()
#print(total_2021)
#AC change calculation
total = total_2018 - total_2021
#print(total)
return total
#iterates through unique values- calls function should print to excel
geo = df['REGION_2D'].unique()
for geog in geo:
funcValue1 = func1(geog)
output.loc[len(output)] = {'REGION_2D': geog,'AC_Change': funcValue1}
#print(output)
output.to_excel('Mock File2.xlsx', index=False)
This is what I was hoping the expected output to look like:
| COUNTRY | REGION_2D | Value |
| -------- | -------- |
| Austria | 1 |- 19.5321
| Ireland | 4 |- 15.2916
Here’s one approach:
import pandas as pd
out = (
df.pivot(index=['Country', 'REGION_2D'], columns='YEAR', values='value')
.diff(periods=-1, axis=1)
.rename(columns={2018: 'Value'})
.dropna(axis=1)
.reset_index()
.rename_axis(columns=None)
)
Output:
Country REGION_2D Value
0 Austria AT10 -19.532129
1 Ireland 4 -15.291564
Explanation
- Use
df.pivot
to reshape data. - Get
df.diff
onaxis=1
withperiods=-1
. - Values will now be in column 2018: apply
df.rename
to change label to “Value”. - Use
df.dropna
to get rid of the empty column (2021). - Finally, apply
df.reset_index
to reset the index, andrename_axis
to get rid of the columns name (i.e. YEAR).