Question:
I have a DataFrame containing loan information over a period of time. I need to create a function get_specific_insights that obtains specific insights from the DataFrame. The function should group the DataFrame by a specified list of columns and retain only the top K values for each group based on a ‘SanctionLimit’ column. The grouping can be hierarchical, meaning I may need to filter by top values at multiple levels of grouping.
For instance, I need to find the top 5 LoanState according to SanctionLimit. Within each of these top states, I need to find the top 3 BankNames, and within each top bank of the top states, I need to find the top 5 AgentNames.
Problem:
The function should be generalized to handle any number of columns in groupby_columns and their corresponding top K values in top_k. The input parameters for the function will be:
df: The DataFrame containing the loan information.
groupby_columns: A list of columns by which the DataFrame should be grouped, in the specified order.
top_k: A list containing the number of top values to retain for each column in groupby_columns, in the specified order.
Here is what I attempted to do:
For the first filter, I found out the top k values. I filtered my dataframe to contain only these value for the column.
For the next filter, I found out the top K values for each of the previous top values and so on.
This will lead to the formation of tuples & I will try to match these tuples to obtain the respective ways from the dataframe. However, I have the feeling that a better(scalable) method is out there that I am not able to think about.
raosnhatlodn is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.