Let’s say I have following datafarme df1
coresponding to user1
:
+-------------------+-------+--------+-------+-------+----------+----------------+
| Models | MAE | MSE | RMSE | MAPE | R² score | Runtime [ms] |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression | 4.906 | 27.784 | 5.271 | 0.405 | -6.917 | 0:00:43.387145 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| Random Forest | 2.739 | 10.239 | 3.2 | 0.231 | -1.917 | 0:28:11.761681 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| XGBoost | 2.826 | 10.898 | 3.301 | 0.234 | -2.105 | 0:03:58.883474 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| MLPRegressor | 5.234 | 30.924 | 5.561 | 0.43 | -7.812 | 0:01:44.252276 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| SVR | 5.061 | 29.301 | 5.413 | 0.417 | -7.349 | 0:04:52.754769 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.454 | 8.823 | 2.97 | 0.201 | -1.514 | 0:19:36.925169 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LGBMRegressor | 2.76 | 10.204 | 3.194 | 0.231 | -1.907 | 0:04:51.223103 |
+-------------------+-------+--------+-------+-------+----------+----------------+
I have following datafarme df2
coresponding to user2
:
+-------------------+-------+--------+-------+-------+----------+----------------+
| Models | MAE | MSE | RMSE | MAPE | R² score | Runtime [ms] |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression | 4.575 | 24.809 | 4.981 | 0.377 | -6.079 | 0:00:45.055854 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| Random Forest | 2.345 | 8.065 | 2.84 | 0.199 | -1.301 | 0:10:55.468473 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| XGBoost | 2.129 | 7.217 | 2.686 | 0.179 | -1.059 | 0:01:01.575033 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| MLPRegressor | 4.414 | 23.477 | 4.845 | 0.363 | -5.699 | 0:00:31.231719 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| SVR | 4.353 | 22.826 | 4.778 | 0.357 | -5.513 | 0:02:12.258870 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.281 | 7.671 | 2.77 | 0.189 | -1.189 | 0:08:16.526615 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LGBMRegressor | 2.511 | 9.18 | 3.03 | 0.212 | -1.619 | 0:15:25.084937 |
+-------------------+-------+--------+-------+-------+----------+----------------+
Let’s say I have more datafarmes df1000
coresponding to user1000
.
Problem statement:
I want to rank Models
result (sorted) over specific column (e.g. MAE
) and return frequency of top models over all dfs (df1
till df1000
). so this not something I can easily reach using:
df["category"].value_counts()
so defintly I need to transform and add list of sorted models’ names that’d be list of strings. including the name of Users
in final transformed dataframe could be also useful however I did not mentioned in following table in expected output.
Expected output:
+-------------------+-------------------------------------------------------+--------+---------+
| Rank | MAE |counts |freq(%) |
+-------------------+-------------------------------------------------------+--------+---------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor",
"XGBoost","LinearRegression","SVR","MLPRegressor"] | 70 | 65% |
| Top models(sorted)| ["LGBMRegressor","CatBoostRegressor","RandomForest",
"XGBoost","LinearRegression","SVR","MLPRegressor"] | 20 | 12% |
....
+-------------------+-------------------------------------------------------+--------+---------+
I also was thinking maybe I can use Natural Language Processing (NLP) methods called TF-IDF to handle this problem using:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer
Potentially related posts I have checked:
- How can I compute a histogram (frequency table) for a single Series?
- Count the frequency that a value occurs in a dataframe column
- Efficient way to get frequency of elements in a pandas column of lists
- Calculate Frequency of item in list
- Get the frequency of individual items in a list of each row of a column in a dataframe
- count the frequency of elements in list of lists in Python
- What’s the best alternative to using lists as elements in a pandas dataframe?
- pandas – create dataframe with counts and frequency of elements
- Python: Calculate PMF for List in Pandas Dataframe
- Frequency plot of a Pandas Dataframe
- python & pandas – How to calculate frequency under conditions in columns in DataFrame?