I have an issue of performing .diff() and .pct_change() calls on rows. The original table has 51 countries so I will use an example of three countries here as follows:
import pandas as pd
data = {'country' : ['Andorra', 'Andorra', 'Andorra', 'Andorra',
'Italy', 'Italy', 'Italy', 'Italy',
'Slovakia', 'Slovakia', 'Slovakia', 'Slovakia'],
'year': [1987, 1988, 1989, 1990,
1987, 1988, 1989, 1990,
1987, 1988, 1989, 1990],
'population': [10, 12, 17, 21,
34, 27, 22, 26,
18, 23, 45, 12]}
df = pd.DataFrame(data)
which results in the following dataframe df
country | year | population | |
---|---|---|---|
0 | Andorra | 1987 | 10 |
1 | Andorra | 1988 | 12 |
2 | Andorra | 1989 | 17 |
3 | Andorra | 1990 | 21 |
4 | Italy | 1987 | 34 |
5 | Italy | 1988 | 27 |
6 | Italy | 1989 | 22 |
7 | Italy | 1990 | 26 |
8 | Slovakia | 1987 | 18 |
9 | Slovakia | 1988 | 23 |
10 | Slovakia | 1989 | 45 |
11 | Slovakia | 1990 | 12 |
If applying df.population.diff()
directly, I get the following result:
country | year | population | change | |
---|---|---|---|---|
0 | Andorra | 1987 | 10 | NaN |
1 | Andorra | 1988 | 12 | 2.0 |
2 | Andorra | 1989 | 17 | 5.0 |
3 | Andorra | 1990 | 21 | 4.0 |
4 | Italy | 1987 | 34 | 13.0 |
5 | Italy | 1988 | 27 | -7.0 |
6 | Italy | 1989 | 22 | -5.0 |
7 | Italy | 1990 | 26 | 4.0 |
8 | Slovakia | 1987 | 18 | -8.0 |
9 | Slovakia | 1988 | 23 | 5.0 |
10 | Slovakia | 1989 | 45 | 22.0 |
11 | Slovakia | 1990 | 12 | -33.0 |
which is wrong, of course, because the change for Italy, 1987 was calculated by subtraction from Andorra, 1990. I must generate a df.change
column in which every entry for individual country, 1987 will be NaN
.
In this simple example, I was able to get such a dataframe by using .diff() “semi-manually” by writing the following code:
diff_andorra = df[ df.country== "Andorra" ].population.diff()
diff_italy = df[ df.country== "Italy" ].population.diff()
diff_slovakia = df[ df.country== "Slovakia" ].population.diff()
diff_series = pd.concat([diff_andorra, diff_italy, diff_slovakia])
df['change'] = diff_series
This resulted in the desired dataframe:
country | year | population | change | |
---|---|---|---|---|
0 | Andorra | 1987 | 10 | NaN |
1 | Andorra | 1988 | 12 | 2.0 |
2 | Andorra | 1989 | 17 | 5.0 |
3 | Andorra | 1990 | 21 | 4.0 |
4 | Italy | 1987 | 34 | NaN |
5 | Italy | 1988 | 27 | -7.0 |
6 | Italy | 1989 | 22 | -5.0 |
7 | Italy | 1990 | 26 | 4.0 |
8 | Slovakia | 1987 | 18 | NaN |
9 | Slovakia | 1988 | 23 | 5.0 |
10 | Slovakia | 1989 | 45 | 22.0 |
11 | Slovakia | 1990 | 12 | -33.0 |
But since the original dataframe has 51 countries altogether, I can’t apply the “semi-manual” technique on all of them. My idea was to create a for
loop of the following form:
diff_temp = pd.Series(dtype=float)
for individual_country in set(df.country):
temp = df[ df.country== individual_country ].population.diff()
diff_temp = pd.concat(temp)
The first line defines an empty series for float
datatype (in analogy how we often create empty list()
or dict()
objects). The set(df.country)
contains the name of all the country entries, but they appear only once, due to the nature of set()
type. For country in the set, the temp
variable holds the .diff() series which would then be .concat()
-ed or .append()
-ed to the diff_temp
series.
And it is on this last line that my code breaks down, showing the following error:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "Series"
If I only print(temp)
after each loop, the output shows this:
0 NaN
1 2.0
2 5.0
3 4.0
Name: population, dtype: float64
4 NaN
5 -7.0
6 -5.0
7 4.0
Name: population, dtype: float64
8 NaN
9 5.0
10 22.0
11 -33.0
Name: population, dtype: float64
This indicates that “piecewise” .diff()
-ing is doable, but I have problem merging them into one series which would then be added as another column with df['change'] = diff_temp
, in order to achieve the above mentioned table, i.e. this (repeated for easier reference):
country | year | population | change | |
---|---|---|---|---|
0 | Andorra | 1987 | 10 | NaN |
1 | Andorra | 1988 | 12 | 2.0 |
2 | Andorra | 1989 | 17 | 5.0 |
3 | Andorra | 1990 | 21 | 4.0 |
4 | Italy | 1987 | 34 | NaN |
5 | Italy | 1988 | 27 | -7.0 |
6 | Italy | 1989 | 22 | -5.0 |
7 | Italy | 1990 | 26 | 4.0 |
8 | Slovakia | 1987 | 18 | NaN |
9 | Slovakia | 1988 | 23 | 5.0 |
10 | Slovakia | 1989 | 45 | 22.0 |
11 | Slovakia | 1990 | 12 | -33.0 |
I am uncertain what am I doing wrong. I tried using .append()
but it resulted in df.change
entries all being equal to NaN
.
Demenaio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.