I have a dataframe that was read in from a file which was tab separated but had within it a column which was semi-colon separated. This column contains most of my actual variables of interest however it is not sorted as some rows contain more information than others, and some rows have duplicate values. The variables of interest do however contain an identifier as pat of their string e.g. “gene eno”.
For each row I would like to identify and paste together all values where there is a match for a given identifier as below:
Current dataframe:
Column A | V9_01 | V9_02 |
---|---|---|
CDS 1 | Index123 | gene “pla” |
CDS 2 | gene “dah” | |
CDS 3 | gene “blah” | Location:456 |
CDS 4 | gene “do” | gene “rah” |
CDS 5 | Index127 | Location893 |
Desired dataframe:
Column A | V9_01 | V9_02 | Gene_Name |
---|---|---|---|
CDS 1 | Index123 | gene “pla” | gene “pla” |
CDS 2 | gene “dah” | gene “dah” | |
CDS 3 | gene “blah” | Location:456 | gene “blah” |
CDS 4 | gene “do” | gene “rah” | gene “do”, gene”rah” |
CDS 5 | Index127 | Location893 | NA |
I have made the current dataframe using the following code to read in the original file:
DP_GTF<-read.delim("E:/Genome_Files/GTF/DolosPig51524.gtf", sep = "t", comment.char = "#", header = F) %>%
subset(V3=="CDS") %>%
#select(c("V9"))%>%
cSplit("V9",";")
I’m not sure how to get my desired dataframe but assume I need to run grep over part of the dataframe?