I have a data frame that has the following colnames
types:
[13] "WTS1_igg1" "WT_RBD_igg1" "WTS2_igg1" "WT_NTD_igg1"
[17] "Alpha_Spike_igg1" "Beta_RBD_igg1" "Beta_Spike_igg1" "Delta.Spike_igg1"
[49] "Beta_Spike_igg3" "Delta.Spike_igg3" "Gamma.Spike_igg3" "Alpha_RBD_igg3"
[165] "IgG1_S2" "IgG1_NTD" "IgG1_N" "IgG1_Alpha_RBD"
[201] "IgG3_Gamma_Spike" "IgG3_Delta_RBD" "IgG3_Delta_Spike" "IgG4_WT_RBD"
Here I have some common patterns, for example, line 1 ([13]) has the WTS1, RBD, or NTD at the beginning or in the middle of the colnames
.
Line 2 ([17]) has the Spike and RBD strings in the middle of the colnames
.
Then, if I go to the end, line 5 ([201]) has the Spike, and RBD strings at the end of the colnames
.
After this brief comment, here is what I’m doing, I’m trying to filter this data frame, after I transform in long format, to individuals df based on the string of interest, here is my code:
# TRANSFORM TO LONG FORMAT
id.vars.v1 <- c('Tube_patient')
# Specify by index column
value.vars <- names(df.infected)[12:ncol(df.infected)]
# Use melt to reshape the df
df.infected.melted <- melt(df.infected,
id.vars = id.vars.v1,
measure.vars = value.vars,
variable.name = 'variable',
value.name = 'value')
# PERFORM THE FILTERING PROCESS
# Define a list of each analyte (THESE ARE THE STRINGS OF INTEREST)
keywords_boxplot <- c('Spike', 'WTS1', 'WTS2', 'RBD', 'N', 'NTD')
# Filter the df by string present in the previous list
filtered_dfs <- lapply(keywords_boxplot, function(keyword) {
pattern <- paste0(keyword, "_")
df.infected.melted %>% filter(str_detect(variable, pattern))
})
# For each string filtered, you can create a new df
names(filtered_dfs) <- paste0("df.", keywords_boxplot)
# Create each df in the global environment
list2env(filtered_dfs, envir = .GlobalEnv)
The issue with this code is that is NOT filtering the long df in individual ones correctly, because is filtering only some strings and not all of them.
For example, based on the comment at the beginning of my post, one df has only this type of strings:
"WTS1_igg1" "WT_RBD_igg1"
this types of column names have the string of interest at the beginning, however, this df, post filtering, does not have any other type of string of interest, for example, the ones that contain the string of interest at the end or in the middle are not present in that particular case:
"IgG3_Gamma_Spike" "IgG3_Delta_RBD" "Beta_RBD_igg1"
In theory, all my df must have at least 750 rows each, which is not the case.
So the idea is to use the presence of the string in the list:
keywords_boxplot <- c('Spike', 'WTS1', 'WTS2', 'RBD', 'N', 'NTD')
to filter the column of interest in my long df, not matter the position of the keyword in the colnames
, that is present in the column variable in the df that is in long format.
What is wrong in my code that is detecting only one type of string (the one that contains the keyword at the beginning but not the ones at the end?
How can I take into account the position of the keyword of interest?
or even better, how to avoid the issue with the position of the keyword?
Any comment, idea, or correction, is more than welcome!