Here is my interpretation of my model so far, I am investigating the relationship between rating and followers on video games but there is a problem. The more you get high ratings the more you get followers but very few of them.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate, learning_curve
polynomial_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = polynomial_features.fit_transform(X)
model_reg = LinearRegression()
cv_result = cross_validate(model_reg, X_poly, y, cv = 5)
cv_result['test_score'].mean()
#0.031169232070832886
The R2 is very low that should not be a surprise when I plot the predicition.
model = LinearRegression()
sorted_df = games_top.sort_values(‘rating’)
sorted_X = sorted_df[[‘rating’]]
sorted_y = sorted_df[‘followers’]
# Creates polynomial
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly_sorted = poly_features.fit_transform(sorted_X)
model.fit(X_poly_sorted, sorted_y)
predictions = model.predict(X_poly_sorted)
#plot predictions over original data
%matplotlib widget
sns.scatterplot(x=sorted_X['rating'], y=sorted_y, alpha=0.5)
plt.plot(sorted_X['rating'], predictions, linewidth=3, color='r')
If i plot my cross-validated 1st model called model_reg its overfitting. Am I in a dead end with this feature? Should I get rid of to much low outliers?