I’m trying to create an initial model for my regression but I don’t want the coefficient for Item5 showing up, just the rest of them and Item5_1, Item5_2, etc since it’s already encoded. I also have to follow the K-1 rule and create a dummy variable for the number of categories in the categorical variables I pick. How can I adjust the code to not give Item5 and instead Item5_1, Item5_2, etc?
categorical_variables <- c(
"City", "State", "County", "Zip", "Area", "Timezone", "Job", "Marital",
"Gender", "ReAdmis", "Soft_drink", "Initial_admin", "HighBlood", "Stroke",
"Complication_risk", "Overweight", "Arthritis", "Diabetes", "Hyperlipidemia",
"BackPain", "Anxiety", "Allergic_rhinitis", "Reflux_esophagitis", "Asthma",
"Services", "Item1", "Item2", "Item3", "Item4", "Item5", "Item6", "Item7", "Item8"
)
continuous_variables <- c(
"Population", "Children", "Age", "Income", "VitD_levels", "Doc_visits",
"Full_meals_eaten", "VitD_supp", "Initial_days", "TotalCharge", "Additional_charges"
)
I then ran anova test and only included variables that are statistically significant, p-value <0.05
Continuous variables
continuous_variables <- c("Children", "VitD_levels", "VitD_supp", "Income") # Included "Income"
Categorical Variables
categorical_variables <- c("Item5" , "HighBlood")
Combine the lists of variables
variables_to_keep <- c(continuous_variables, categorical_variables)
Subset the dataframe to only include these variables
medical_data <- medical_data[, variables_to_keep]
Using these variables as dummy variables because they have 2 or more categories for their data
Create dummy variables for the categorical variables K-1 rule
Would just include 1 dummy variable since ‘HighBlood’ has 2 categories. HighbloodYes would be in the model
medical_data_with_dummies <- dummy_cols(medical_data, select_columns = c("Item5", "Reflux_esophagitis"))
Replacing a variable with over 3 unique categories with one with a relatively low p-value
Extract the names of the dummy variables created, excluding the original variables
categorical_dummy_variables <- setdiff(names(medical_data_with_dummies), names(medical_data))
Now that categorical data has been converted to dummy variables, I can output cleaned data for submission
Made file path for the CSV file
file_path <- "/Users/name/Desktop/medical_clean_forReview.csv"
Filter out dummy variables that only have 1 factor level
Remove columns with only one unique value (dummies with 1 level), to ensure easy analysis
medical_data_with_dummies <- medical_data_with_dummies[, sapply(medical_data_with_dummies, function(x) length(unique(x)) != 1)]
Assuming ‘medical_data_with_dummies’ is your dataframe and it’s already in your R environment
write.csv(medical_data_with_dummies, file = file_path, row.names = FALSE)
Create an initial multiple linear regression model
model <- lm(Income ~ ., data = medical_data_with_dummies)
I tried to remove Item5 from
categorical_variables <- c("Item5" , "HighBlood")
but when I ran
medical_data_with_dummies <- dummy_cols(medical_data, select_columns = c("Item5", "Reflux_esophagitis"))
It says:
Error in dummy_cols(medical_data, select_columns = c("Item5", "Reflux_esophagitis")) :
select_columns is/are not in data. Please check data and spelling.