I am working with a dataframe on county executives. I want to run a panel study where the unit of analysis is the county-year.
The problem is that sometimes two or more county executives serve during the same year. I want to remove these semi-duplicate rows. I ALWAYS want to keep the county executive that is listed first.
If my initial df is:
df <- data.frame(year= c(2000, 2001, 2001, 2002, 2000, 2001, 2002, 2002, 2002),
executive.name= c("Johnson", "Smith", "Peters", "Alleghany", "Roberts", "Clarke", "Tollson", "Brown", "Taschen"),
district= c(1001, 1001, 1001, 1001, 1002, 1002, 1002, 1002, 1002))
I want to make it look like this
df.neat <- data.frame(year= c(2000, 2001, 2002, 2000, 2001, 2002),
executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
district= rep(c(1001, 1002), each=3))
1
You can do group by district and year, then take the first row from each group.
library(dplyr)
df_neat <- df %>%
group_by(district, year) %>%
slice_head(n = 1) %>%
ungroup()
Chen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1
You can achieve this by using the drop_duplicates
function in pandas, which will help you remove duplicate rows based on the year
and district
columns, while keeping the first occurrence. Here’s how you can do it in Python:
import pandas as pd
# Create the initial DataFrame
df = pd.DataFrame({
'year': [2000, 2001, 2001, 2002, 2000, 2001, 2002, 2002, 2002],
'executive.name': ['Johnson', 'Smith', 'Peters', 'Alleghany', 'Roberts', 'Clarke', 'Tollson', 'Brown', 'Taschen'],
'district': [1001, 1001, 1001, 1001, 1002, 1002, 1002, 1002, 1002]
})
# Drop duplicates while keeping the first occurrence
df_neat = df.drop_duplicates(subset=['year', 'district'], keep='first').reset_index(drop=True)
# Display the resulting DataFrame
df_neat
The resulting df_neat
will look like this:
year executive.name district
0 2000 Johnson 1001
1 2001 Smith 1001
2 2002 Alleghany 1001
3 2000 Roberts 1002
4 2001 Clarke 1002
5 2002 Tollson 1002
This code ensures that for each combination of year
and district
, only the first occurrence of the county executive is kept, effectively removing any semi-duplicate rows.
Base R
first = (x) x[1L]
aggregate(.~year+district, df, first)
#> year district executive.name
#> 1 2000 1001 Johnson
#> 2 2001 1001 Smith
#> 3 2002 1001 Alleghany
#> 4 2000 1002 Roberts
#> 5 2001 1002 Clarke
#> 6 2002 1002 Tollson