I am doing some genetics work and I am trying to utilize two data sets, however each one refers to the genes differently. One dataset (lets call it Column A) refers to each gene by its conventional name as well as a specific ID. The other data set uses some combination of those two variables. I’d rather not have to rerun the pipeline to get the names to align, so I was hoping I could match and copy values as per the title. To reiterate, I was hoping that for each value in column B, look in column A for the “best match” and set value to that value.
Below is an example of the columns and my issues
Column A | Column B |
---|---|
gene1-ID001 | gene1 |
gene2-ID002 | ID002 |
gene3-ID003 | gene3-ID003 |
gene4-ID004 | gene4 |
Happy to learn about any established tools for these kinds of issues as well. Thanks you.
I’m fairly certain I need to use grep in some manner to check for partial matches before setting values, but the finesse in setting this up is something I’m lacking. Handling issues when there are more than one partial matches and picking the “most matching” is a problem.
awsk is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.