How to compare two tab-separated files and mark ‘Fail’ if col6 of file1 matches col2 of file2, else ‘Pass’? [closed]

I want to compare two tab-separated files (with headers) and check whether column 6 of the first file (peptide.tsv) matches exactly with column 2 of the second file (proteome.tsv). If there is an exact match, print Fail, otherwise print Pass. Also col6 may not be unique. There can be identical sequences with different locus.

==> test_peptide.tsv <==
FusionID        Peptide HLA_Allele      EL-score        EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNOVEL*PEPTIDE

==> test_proteome.tsv <==
Name    AA_seq
sp|A0A0G2JMI3|HV692_HUMAN Immunoglobulin heavy variable 1-69-2 OS=Homo sapiens OX=9606 GN=IGHV1-69-2 PE=1 SV=2  MDCTWRILLLVAAATGTHAEVQLVQSGAEVKKPGATVKISCKVSGYTFTDYYMHWVQQAPGKGLEWMGLVDPEDGETIYAEKFQGRVTITADTSTDTAYMELSSLRSEDTAVYYCA
rp|A0A0G2JMI3|HV692_HUMAN Immunoglobulin heavy variable 1-69-2 OS=Homo sapiens OX=9606 GN=IGHV1-69-2 PE=1 SV=6  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPPT

What I am getting:

FusionIDPass    Peptide HLA_Allele      EL-score        EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRPassAPWKVMFVVWSLLAQETQ*SRETCRRPP
TUFT1--PPass1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNOVEL*PEPTIDE

Tried these

awk -F\t 'NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv

awk 'NR==FNR {proteome[$2]; next} {status = ($6 in proteome ? "Fail" : "Pass"); print $0 "\t" status}' test_proteome.tsv test_peptide.tsv

expected_output.tsv

FusionID        Peptide HLA_Allele      EL-score        EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNOVEL*PEPTIDE Pass

Since OP has posted the (incorrect) output generated by their code I’m going to assume the awk scripts they’ve posted are not their real code otherwise the escaped brackets would be generating syntax errors, g:

$ awk -F\t 'NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
awk: cmd. line:1: NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}
awk: cmd. line:1:           ^ backslash not last character on line
awk: cmd. line:1: NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}
awk: cmd. line:1:           ^ syntax error

Also, to designate a tab as the field delimiter we need to quote the t.

Until OP updates the question with the actual code I’m going to assume we’re starting with the following:

awk -F't' 'NR==FNR {a[$2];next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv

awk 'NR==FNR {proteome[$2]; next} {status = ($6 in proteome ? "Fail" : "Pass"); print $0 "t" status}' test_proteome.tsv test_peptide.tsv

The second script will fail to generate the desired results because awk will use the default field separator of white space (ie, all spaces and tabs will be processed as field separators). This means the values in the column 2 of test_proteome.tsv are AA_seq, Immunoglobulin and Immunoglobulin.

If we update the second script to awk -F't' 'NR==FNR ...' we find that both scripts now generate:

FusionID    Peptide HLA_Allele  EL-score    EL_Rank AA_seq_full Pass
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT  GSDGCWGLY   HLA-A01:01  0.79990000000000006 0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP  Pass
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT  GSDGCWGLY   HLA-A01:01  0.79990000000000006 0.1002  MNOVEL*PEPTIDE  Pass

A couple issues:

header line has the string Pass appended on the end (OP hasn’t coded anything to handle the header so the ternary operation applies here, too)
the first data line ends with Pass instead of Fail

Looking at the data we see there are no matches in the given input data; of particular interest are the strings that start with MNGTRN:

test_proteome.tsv : MNGTRN ... RRPPT
test_peptide.tsv  : MNGTRN ... RRPP

Notice one has a trailing T while the other is missing the trailing T. Since OP has stated we are looking for an exact match, and we have no exact matches, this associated line is appended with Pass.

Since OP’s expected output does not display the trailing T I’m going to assume this was a typo. I’ve removed the trailing T from my copy of test_proteome.tsv.

At this point both of OP’s scripts generate the desired output … except for that Pass in the header line:

FusionID    Peptide HLA_Allele  EL-score    EL_Rank AA_seq_full Pass
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT  GSDGCWGLY   HLA-A01:01  0.79990000000000006 0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP  Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT  GSDGCWGLY   HLA-A01:01  0.79990000000000006 0.1002  MNOVEL*PEPTIDE  Pass

Modifying OP’s 1st script to handle the header line:

########
# do not add to the header line:

$ awk -F't' 'NR==FNR {a[$2];next} FNR==1 {print $0;next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
FusionID    Peptide HLA_Allele  EL-score    EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT  GSDGCWGLY   HLA-A01:01  0.79990000000000006 0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP  Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT  GSDGCWGLY   HLA-A01:01  0.79990000000000006 0.1002  MNOVEL*PEPTIDE  Pass


########
# add 'Result' on end of header line:

$ awk -F't' 'NR==FNR {a[$2];next} FNR==1 {print $0 FS "Result";next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
FusionID    Peptide HLA_Allele  EL-score    EL_Rank AA_seq_full Result
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT  GSDGCWGLY   HLA-A01:01  0.79990000000000006 0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPPT Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT  GSDGCWGLY   HLA-A01:01  0.79990000000000006 0.1002  MNOVEL*PEPTIDE  Pass

NOTE: I’m assuming the next-to-last field of the last line in OP’s expected output is a typo and the asterisk is not supposed to be escaped, ie, MNOVEL*PEPTIDE should be MNOVEL*PEPTIDE

An example using a bash script in a didactic way, not compressed in a one liner awk command

#!/bin/bash

# Input files
file1="test_peptide.tsv"
file2="test_proteome.tsv"

# Output file
output_file="output.tsv"

# Column numbers to compare
column1=6
column2=2

#write the headers to the output file
printf '%stResultn' "$(head -n 1 $file1)" > $output_file

#for row 2 to the number of lines in the file1
for i in $(seq 2 "$(cat $file1 | wc -l)")
do
    if [ "$(awk -F 't' -v i="$i" -v col1="$column1" 'NR==i {print $col1}' $file1)" 
    == "$(awk -F 't' -v i="$i" -v col2="$column2" 'NR==i {print $col2}' $file2)" ]
    then
        printf '%stFailn' "$(awk -v i="$i" 'NR==i {print $0}' $file1)" >> $output_file
    else
        printf '%stPassn' "$(awk -v i="$i" 'NR==i {print $0}' $file1)" >> $output_file
    fi
done

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 19:14

Thẻ: linuxbashawkbioinformatics

Thiết kế website giá rẻ

Danh mục

How to compare two tab-separated files and mark ‘Fail’ if col6 of file1 matches col2 of file2, else ‘Pass’? [closed]