I am trying to extract specific columns from a .bam file. The columns are tagged with a prefix “TX:Z:” and “GX:Z:”. I seen people using perl
to get specific columns, however, I don’t understand the logic well enough to adjust for my case. I can’t just save the output and subset in R because the file is too large.
Otherwise, if there is a samtools solution, that would also be great. I am trying to extra the accession numbers. The bam files were generated from scRNAseq libraries and cell ranger (10x).
Here are 3 lines:
A01604:525:HKHMKDRX3:2:2178:19253:32612 16 NC_030416.2 10591 255 47M2I41M * 0 0 TGATAAATACACAGTTCATTCCTCATACACAAAGACAAAATAAAACACAGAGTATTACAGACGACAAAAGAGAAGGAAGATGGAGATGTG FFFFFFFFFFFFFFFF:FFFF,FFF:FFFFFF::FFFF:FFF:F:F:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:80 nM:i:0 RG:Z:SP1L_IGM_plus:0:1:HKHMKDRX3:2 TX:Z:XR_008397195.1,+3112,41M2I47M GX:Z:khdc4 GN:Z:khdc4 fx:Z:khdc4 RE:A:E xf:i:25 CR:Z:TTGACCCCCATCGTTG CY:Z:FFFFFFFFFFFFF:FF CB:Z:TTGACCCCAATCGTTG-1 UR:Z:ACCCTAACCCCA UY:Z:FFFFFFFFFFFF UB:Z:ACCCTAACCCCA
A01604:525:HKHMKDRX3:1:2144:31973:6167 16 NC_030416.2 10800 255 90M * 0 0 ACAGACAACGATGCATCCAGATGTCTGTTAGGGAATTAAGAGTTTTAATATTATTTCTTGTAGCTTCCAAAGTCTTTACTCTGGCTGGTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:84 nM:i:2 RG:Z:SP1L_IGM_plus:0:1:HKHMKDRX3:1 TX:Z:XR_008397195.1,+2901,90M GX:Z:khdc4 GN:Z:khdc4 fx:Z:khdc4 RE:A:E xf:i:25 CR:Z:ACCTGAATCGCCAATA CY:Z:FFFFFFFFFFFFFFFF CB:Z:ACCTGAATCGCCAATA-1 UR:Z:AGGGCACTGTCA UY:Z:FFFFFFFFFFFF UB:Z:AGGGCACTGTCA
A01604:525:HKHMKDRX3:2:2163:18810:10958 1040 NC_030416.2 10839 255 73M17S * 0 0 GAGTTTTAATATTATTTCTTGTAGCTTCCAAAGTCTTTACTCTGGCTGGTGTTATGAGGTAAAATAGGGGATTAGGTGTAAGTATTACAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF NH:i:1 HI:i:1 AS:i:69 nM:i:1 RG:Z:SP1L_IGM_plus:0:1:HKHMKDRX3:2 TX:Z:XR_008397195.1,+2879,17S73M GX:Z:khdc4 GN:Z:khdc4 fx:Z:khdc4 RE:A:E xf:i:17 CR:Z:ACCTGAATCGCCAATA CY:Z:FFFFFFFFFFFF:FFF CB:Z:ACCTGAATCGCCAATA-1 UR:Z:AGGGCACTGTCA UY:Z:FFFFFFFFFFFF UB:Z:AGGGCACTGTCA
I tried selecting the specific columns using awk
but it seems like the in some rows there is an extra tab, meaning an extra column is added to these rows.