I have a dataset, which is a list of pairwise comparison stimuli I want to randomly and evenly allocate to subsetted groups, but with some rules.
The comparison data frame is as below:
structure(list(comparison = c(1, 2, 3, 4, 5, 6), `speaker 1` = c("a",
"a", "a", "a", "a", "a"), `condition 1` = c(1, 1, 1, 1, 1, 1),
`voice 1` = c("a1", "a1", "a1", "a1", "a1", "a1"), `speaker 2` = c("b",
"c", "d", "e", "b", "c"), `condition 2` = c(1, 1, 1, 1, 2,
2), `voice 2` = c("b1", "c1", "d1", "e1", "b2", "c2")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
there are 360 comparisons that I want to subset into 6 equal groups of exactly 60 comparisons.
I want to make sure therefore that firstly all unique comparisons are allocated to a subset; secondly that all subsets contain exactly 60 unique comparisons; and thirdly that all subsets contain exactly 4 comparisons for each voice, whether in ‘voice 1’ column or ‘voice 2’ column.
I have been troubleshooting some code for a few days, and have now been running the following for several hours with no errors, but it seems like it’s too computationally demanding and I’m wondering if there’s a workaround:
# Function to split the data into equal subsets with exactly 60 comparisons in each subset
split_data <- function(stimuli_manipulation, num_subsets) {
# Shuffle the data
stimuli_manipulation <- stimuli_manipulation[sample(nrow(stimuli_manipulation)), ]
# Initialize an empty list to store subsets
subsets <- vector("list", num_subsets)
# Initialize counters for each voice
voice_counts <- table(unlist(stimuli_manipulation[, c("voice 1", "voice 2")]))
# Initialize an index for subsets
subset_index <- 1
# Initialize a counter for the total number of comparisons across subsets
total_comparisons <- 0
# Initialize an index for the current comparison
i <- 1
# Loop until all subsets have 60 comparisons
while (total_comparisons < 60 * num_subsets) {
# Check if the total number of comparisons across subsets reaches 60 * num_subsets
if (total_comparisons >= 60 * num_subsets) {
break
}
# Get the current comparison
comparison <- stimuli_manipulation[i, ]
# Check if there are no missing values in the comparison
if (!is.na(comparison$`voice 1`) && !is.na(comparison$`voice 2`)) {
# Check if adding the comparison maintains balance for each voice
if ((voice_counts[comparison$`voice 1`] + voice_counts[comparison$`voice 2`]) < 4) {
# Add the comparison to the subset
subsets[[subset_index]] <- rbind(subsets[[subset_index]], comparison)
# Update the counter for each voice
voice_counts[comparison$`voice 1`] <- voice_counts[comparison$`voice 1`] + 1
voice_counts[comparison$`voice 2`] <- voice_counts[comparison$`voice 2`] + 1
# Increment the total number of comparisons
total_comparisons <- total_comparisons + 1
}
}
# Move to the next comparison
i <- i %% nrow(stimuli_manipulation) + 1
# Move to the next subset
subset_index <- subset_index %% num_subsets + 1
}
# Save the subsets as data frames in the workspace and as CSV files
for (i in 1:num_subsets) {
assign(paste0("subset_", i), subsets[[i]], envir = .GlobalEnv)
write.csv(subsets[[i]], paste0("subset_", i, ".csv"), row.names = FALSE)
}
# Return the list of subsets
return(subsets)
}
# Call the function to split the data into subsets
subsets <- split_data(stimuli_manipulation, 6)
I also think I need to change it so that the ‘voice counts’ sums across voice 1 and 2, so when updating the count it’s accurate, but I’m figuring this out on my own and don’t have much experience!
Thanks in advance