i’m new to ml and i’m trying my hand at mlr to predict a student’s final grade using this kaggle dataset
i know that you’re supposed to
a) encode all binary and categorical data
b) discard variables displaying multicollinearity
c) pick variables that have a linear relationship with your dependent (i’m a bit confused about this as well since in a lot of videos i’ve watched so far, people don’t really check for this when training their model using linear regression)
i have done (a) by labelencoding my binary variables, and onehotencoding my categorical variables. i even calculated my vif for each (the onehotencoded ones have an infinite vif value). and now… i’m not sure how to proceed. all i can safely say is that since G2 has the high vif score, i can discard it; and that since Medu has a close but higher score than Fedu, i can discard that too (same with Walc and Dalc)
here are the value i’m getting:
const 0.000000
school 1.518331
sex 1.489316
age 1.818399
address 1.388570
famsize 1.153361
Pstatus 1.145962
Medu 2.946452
Fedu 2.147572
traveltime 1.322387
studytime 1.398220
failures 1.567588
schoolsup 1.262329
famsup 1.306325
paid 1.339139
activities 1.167950
nursery 1.153852
higher 1.316551
internet 1.258651
romantic 1.179480
famrel 1.173444
freetime 1.322079
goout 1.496537
Dalc 2.036903
Walc 2.405555
health 1.181635
absences 1.297898
G1 4.794857
G2 8.414788
G3 6.483623
Mjob__at_home inf
Mjob__health inf
Mjob__other inf
Mjob__services inf
Mjob__teacher inf
Fjob__at_home inf
Fjob__health inf
Fjob__other inf
Fjob__services inf
Fjob__teacher inf
reason__course inf
reason__home inf
reason__other inf
reason__reputation inf
guardian__father inf
guardian__mother inf
guardian__other inf
dtype: float64
Ayesha Ejaz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.