I have separated a dataset between train and test like this. By selecting the 80%th index of rows and spliting it.
# Eliminamos variables no útiles
df <- df[,!(colnames(df) %in% c("sqm_lot", "sql_lot15"))]
df_train <- df[1:as.integer(nrow(df)*0.8),]
df_test <- df[as.integer(nrow(df)*0.8):nrow(df),]
nrow(df_test)
## [1] 4324
nrow(df_train)
## [1] 17290
This dataset, by the way, has the following info
## 'data.frame': 21613 obs. of 16 variables:
## $ price_eur : num 206367 500340 167400 561720 474300 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : int 1 2 1 3 2 4 2 1 1 2 ...
## $ sqm_living : num 109.6 238.8 71.5 182.1 156.1 ...
## $ sqm_lot : num 525 673 929 464 751 ...
## $ floors : int 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ sqm_basement: num 0 37.2 0 84.5 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated: int 0 1991 0 0 0 0 0 0 0 0 ...
## $ sqm_living15: num 124 157 253 126 167 ...
## $ sqm_lot15 : num 525 710 749 464 697 ...
Then I trained a linear model like this.
x_train <- df_train[,!(colnames(df) %in% "price_eur")]
y_train <- df_train$price_eur
# Entrenamos el modelo
modelo.mlineal <- lm(formula = price_eur ~ ., data = df_train)
modelo.mlineal
##
## Call:
## lm(formula = price_eur ~ ., data = df_train)
##
## Coefficients:
## (Intercept) bedrooms bathrooms sqm_living sqm_lot
## 5.696e+06 -4.366e+04 5.886e+04 2.371e+03 3.063e-01
## floors waterfront view condition sqm_basement
## 2.524e+04 5.327e+05 4.727e+04 2.125e+04 -3.477e+02
## yr_built yr_renovated sqm_living15 sqm_lot15
## -3.000e+03 1.571e+01 1.039e+03 -8.048e+00
Then I used my df_test dataset to predict and compare with my
# Tomamos los valores de y_test y x_test
y_test <- df_test$price_eur
x_test <- df_test[,!(colnames(df_test) %in% "price_eur")]
# Restamos sqm_living al modelo y lo metemos con tubería a predict.lm
prediccion_prueba <- modelo.mlineal %>% predict.lm( data = df_test )
print(length(prediccion_prueba))
## [1] 17290
print(length(y_test))
## [1] 4324
As you can see I have different lengths in my y_test and y_pred, though I am parsing my df_test in the predict.lm
func. Instead, it is giving me the length of the original df_train that I used to train it with. Why is this happening?
I was expecting to have, well… obviously the same lengths for my variable y_test and prediccion_prueba.
My desired output would look like this.
print(length(prediccion_prueba))
## [1] 4324
print(length(y_test))
## [1] 4324
ffriast is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.