==============================================================================
I am relatively new to the world of Python and currently in the process of learning how to effectively utilize its libraries and tools for data manipulation.
As I am still in the early stages of my Python journey, I am seeking advice and guidance on how to solve specific problems I encounter, particularly in handling missing data within a structured dataset.
=============================================================================
I am working with a pandas DataFrame that contains socio-economic indicators for various countries over several years (1960-2023). Each row corresponds to a country and a specific year, with columns for each indicator. Many of these columns have missing values which I need to fill based on the most recent non-missing value for each country. The fill should only continue until a new non-missing value is encountered, and it should restart with each new country.
Here’s a simplified example of what the DataFrame might look like:
Country Name | Year | Indicator1 | Indicator2 | Indicator3
Argentina | 2000 | 20 | NaN | NaN
Argentina | 2001 | NaN | NaN | 20
Argentina | 2002 | 40 | 30 | NaN
Brazil | 2000 | 15 | NaN | NaN
Brazil | 2001 | NaN | 20 | 10
I need the output to fill missing values based on the most recent previous entry per column, but each country’s data should be treated independently. Here’s the desired output:
Country Name | Year | Indicator1 | Indicator2 | Indicator3
Argentina | 2000 | 20 | 30 | 20
Argentina | 2001 | 20 | 30 | 20
Argentina | 2002 | 40 | 30 | 20
Brazil | 2000 | 15 | 20 | 10
Brazil | 2001 | 15 | 20 | 10
I have tried to implement a function in Python using pandas to fill missing values in a DataFrame containing socio-economic indicators for different countries over several years. The DataFrame is loaded from an Excel file, sorted by ‘Country Name’ and ‘Year’, and then I attempted to fill missing values forward within each country group.
import pandas as pd
def conditional_fill_down(file_path):
data = pd.read_excel(file_path)
data_sorted = data.sort_values(by=['Country Name', 'Year'])
data_filled = data_sorted.groupby('Country Name').apply(lambda group: group.ffill()).reset_index(drop=True)
return data_filled
file_path = 'poverty.xlsx'
filled_data = conditional_fill_down(file_path)
output_path = 'poverty_edited.xlsx'
filled_data.to_excel(output_path, index=False)
However, this approach didn’t work.
Could someone help me implement this in pandas in a way that the filling respects the boundaries of each country group and stops filling when a new non-missing value appears?
David Gjeka is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.