To me it seems like it should be a straight-out-of-the-box feature when working with spark data tables, but I can simply not find out how to train one model per spark partition (or some sort of grouping) using sparklyR
in R.
I am aware of the spark_apply()
function, but it seems to be tailored for user defined functions which are not already available in Spark.
I want something like the below where the linear model is fitted on each group (it could also be on each partition). The problem with the below model is that its the global fit (across both groups), and not two separate fits (one for each group).
library(sparklyr)
# Make data
set.seed(134)
n <- 10
x <- 1:n
y <- -x1 + rep(c(10, 20), each = 5) + rnorm(n)
df <- data.frame(
group = rep(c("A", "B"), each = n/2),
y = y,
x = x
)
# Make Spark data frame
sc <- spark_connect(master = "local")
sdf <- copy_to(sc, df, name = "sdf")
# Fit linear model
sdf %>%
group_by(group) %>%
ml_linear_regression(y ~ x)
How can I fit one model for each group or partition?