I have two data sets. The first is a list of specifications:
x = c("a","b","c")
The second is a data frame where one column is a list of job titles and the second column is a list of specifications included in that job title:
y = c("job1","job2", "job3","job4","job5","job6","job7")
z = c("a","b","c","a, b","a,c","b,c","a, b, c" )
df = data.frame(y,z)
I want to write code that will output all of the different combination of job titles whose specifications would include everything from the first list. For example: job1, job2, job3 together would meet the specifications, as would job1 and job6, or just job7.
I can get one specification, but not output all of the different combinations that would suffice.
4
Here is one way to do this:
y = c("job1","job2", "job3","job4","job5","job6","job7")
z = c("a","b","c","a, b","a,c","b,c","a, b, c" )
df = data.frame(y,z)
# Start by making true/false columns for whether or not a job has a/b/c
df$a <- grepl('a',df$z)
df$b <- grepl('b',df$z)
df$c <- grepl('c',df$z)
# Every combination meeting your criteria will include at least one job with A, one with B, and one with C
a_jobs <- df$y[df$a]
b_jobs <- df$y[df$b]
c_jobs <- df$y[df$c]
# Make all combinations that meet your criteria
jobcombs <- expand.grid(a=a_jobs,b=b_jobs,c=c_jobs)
# Turn this into a list of vectors. Sort the vectors, and remove within-vector duplicates so (job4,job4,job3) becomes (job3,job4)
jobvecs <- apply(jobcombs,1,function(x) { return(sort(unique(x))) })
# Remove duplicate vectors from the list - there may be duplicates now that they're sorted and order doesn't make them different
# Eg. (job4,job6,job5) and (job5,job4,job6) are now both (job4,job5,job6) after sorting
jobvecs <- jobvecs[!duplicated(jobvecs)]
# Remove combinations where you don't actually need all three jobs to get to A+B+C, eg. job7/job1/job2
rownames(df) <- df$y # For easy indexing
check_all_necessary <- function(dat,jobs,critcols=c('a','b','c')) {
totals <- colSums(dat[jobs,critcols])
for(j in jobs) {
if(min(totals - unlist(dat[j,critcols])) > 0) { # All requirements still met without this job
return(F)
}
}
return(T)
}
keep <- list()
dontkeep <- list()
for(jvec in jobvecs) {
if(check_all_necessary(df,jvec)) {
keep[[length(keep) + 1]] <- jvec
} else {
dontkeep[[length(dontkeep) + 1]] <- jvec
}
}
And some quick checks of the results:
check_comb <- function(dat,jobs,critcols=c('a','b','c')) {
return(dat[jobs,critcols])
}
lapply(keep,check_comb,dat=df)
lapply(dontkeep)
Here are examples of what got discarded in dontkeep
because not all jobs in the combination were necessary to meet the a+b+c requirement:
[[1]]
a b c
job2 FALSE TRUE FALSE
job3 FALSE FALSE TRUE
job4 TRUE TRUE FALSE
[[2]]
a b c
job2 FALSE TRUE FALSE
job3 FALSE FALSE TRUE
job5 TRUE FALSE TRUE
In the first combination, job2 is not necessary, you could meet requirements with job3+job4 only. In the second combination, job3 is not necessary, you could meet requirements with job2+job5 alone.
Here are all of the final kept combinations, collapsed to pasted strings with sapply(keep,paste,collapse='_')
:
"job1_job2_job3" "job3_job4" "job2_job5" "job4_job5" "job5_job6"
"job4_job6" "job1_job6" "job7"