I’m encountering an issue while trying to count the number of distinct values in a Spark DataFrame column based on a condition using sparklyr. Here’s the code I’m using:
library(sparklyr)
library(dplyr)
####df
df <- data.frame(
appl = c("Apple", "Microsoft", "Google", "Amazon", "Facebook", "Samsung", "IBM"),
appl_y = c("y", "n", "y", "n", "y", "n", "y"),
manu = c("USA", "USA", "USA", "China", "USA", "South Korea", "USA"),
alternate_flag = c("y", "n", "y", "y", "n", "y", "n")
)
# Connect to Spark
sc <- spark_connect(master = "local")
# Create the Spark DataFrame
df_spark <- copy_to(sc, df, "df_spark")
# Group by 'manu' and summarize
result <- df_spark %>%
group_by(manu) %>%
summarize(num_appl_y = n_distinct(appl[appl_y == 'y']),
num_appl_flag=n_distinct(appl[alternate_flag == 'y'])
)
Show the result
collect(result)
The intention is to group the data by the manu column and then count the number of distinct values in the appl column where the corresponding appl_y and alternate_flag column is ‘y’ within each group. However, this doesn’t work, the count is off when I do it this way in sparklyr.