I want to compare two tab-separated files (with headers) and check whether column 6 of the first file (peptide.tsv) matches exactly with column 2 of the second file (proteome.tsv). If there is an exact match, print Fail, otherwise print Pass. Also col6 may not be unique. There can be identical sequences with different locus.
==> test_peptide.tsv <==
FusionID Peptide HLA_Allele EL-score EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE
==> test_proteome.tsv <==
Name AA_seq
sp|A0A0G2JMI3|HV692_HUMAN Immunoglobulin heavy variable 1-69-2 OS=Homo sapiens OX=9606 GN=IGHV1-69-2 PE=1 SV=2 MDCTWRILLLVAAATGTHAEVQLVQSGAEVKKPGATVKISCKVSGYTFTDYYMHWVQQAPGKGLEWMGLVDPEDGETIYAEKFQGRVTITADTSTDTAYMELSSLRSEDTAVYYCA
rp|A0A0G2JMI3|HV692_HUMAN Immunoglobulin heavy variable 1-69-2 OS=Homo sapiens OX=9606 GN=IGHV1-69-2 PE=1 SV=6 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPPT
What I am getting:
FusionIDPass Peptide HLA_Allele EL-score EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRPassAPWKVMFVVWSLLAQETQ*SRETCRRPP
TUFT1--PPass1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE
Tried these
awk -F\t 'NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
awk 'NR==FNR {proteome[$2]; next} {status = ($6 in proteome ? "Fail" : "Pass"); print $0 "\t" status}' test_proteome.tsv test_peptide.tsv
expected_output.tsv
FusionID Peptide HLA_Allele EL-score EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE Pass
2
Since OP has posted the (incorrect) output generated by their code I’m going to assume the awk
scripts they’ve posted are not their real code otherwise the escaped brackets would be generating syntax errors, g:
$ awk -F\t 'NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
awk: cmd. line:1: NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}
awk: cmd. line:1: ^ backslash not last character on line
awk: cmd. line:1: NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}
awk: cmd. line:1: ^ syntax error
Also, to designate a tab as the field delimiter we need to quote the t
.
Until OP updates the question with the actual code I’m going to assume we’re starting with the following:
awk -F't' 'NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
awk 'NR==FNR {proteome[$2]; next} {status = ($6 in proteome ? "Fail" : "Pass"); print $0 "t" status}' test_proteome.tsv test_peptide.tsv
The second script will fail to generate the desired results because awk
will use the default field separator of white space (ie, all spaces and tabs will be processed as field separators). This means the values in the column 2 of test_proteome.tsv
are AA_seq
, Immunoglobulin
and Immunoglobulin
.
If we update the second script to awk -F't' 'NR==FNR ...'
we find that both scripts now generate:
FusionID Peptide HLA_Allele EL-score EL_Rank AA_seq_full Pass
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP Pass
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE Pass
A couple issues:
- header line has the string
Pass
appended on the end (OP hasn’t coded anything to handle the header so the ternary operation applies here, too) - the first data line ends with
Pass
instead ofFail
Looking at the data we see there are no matches in the given input data; of particular interest are the strings that start with MNGTRN
:
test_proteome.tsv : MNGTRN ... RRPPT
test_peptide.tsv : MNGTRN ... RRPP
Notice one has a trailing T
while the other is missing the trailing T
. Since OP has stated we are looking for an exact match, and we have no exact matches, this associated line is appended with Pass
.
Since OP’s expected output does not display the trailing T
I’m going to assume this was a typo. I’ve removed the trailing T
from my copy of test_proteome.tsv
.
At this point both of OP’s scripts generate the desired output … except for that Pass
in the header line:
FusionID Peptide HLA_Allele EL-score EL_Rank AA_seq_full Pass
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE Pass
Modifying OP’s 1st script to handle the header line:
########
# do not add to the header line:
$ awk -F't' 'NR==FNR {a[$2];next} FNR==1 {print $0;next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
FusionID Peptide HLA_Allele EL-score EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE Pass
########
# add 'Result' on end of header line:
$ awk -F't' 'NR==FNR {a[$2];next} FNR==1 {print $0 FS "Result";next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
FusionID Peptide HLA_Allele EL-score EL_Rank AA_seq_full Result
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPPT Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE Pass
NOTE: I’m assuming the next-to-last field of the last line in OP’s expected output is a typo and the asterisk is not supposed to be escaped, ie, MNOVEL*PEPTIDE
should be MNOVEL*PEPTIDE
5
An example using a bash script in a didactic way, not compressed in a one liner awk command
#!/bin/bash
# Input files
file1="test_peptide.tsv"
file2="test_proteome.tsv"
# Output file
output_file="output.tsv"
# Column numbers to compare
column1=6
column2=2
#write the headers to the output file
printf '%stResultn' "$(head -n 1 $file1)" > $output_file
#for row 2 to the number of lines in the file1
for i in $(seq 2 "$(cat $file1 | wc -l)")
do
if [ "$(awk -F 't' -v i="$i" -v col1="$column1" 'NR==i {print $col1}' $file1)"
== "$(awk -F 't' -v i="$i" -v col2="$column2" 'NR==i {print $col2}' $file2)" ]
then
printf '%stFailn' "$(awk -v i="$i" 'NR==i {print $0}' $file1)" >> $output_file
else
printf '%stPassn' "$(awk -v i="$i" 'NR==i {print $0}' $file1)" >> $output_file
fi
done
13