Bash code not counting identical lines properly

I need to get a count of how many unique blast hits I’ve obtained for my list of OTUs/ASVs but for some reason, in what appears to be random cases lines that are identical are not adding up properly. Other times however, the code works fine and is summing up identical lines no problem. I have already made sure to run dos2unix on my files, I’m running my command on Linux and I don’t know what else to check for.

Here is a reproducible example:

Input file called check_text.csv with lines below:

ASV Genus_Species   Taxid
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

My code I run:

dos2unix check_text.csv
dos2unix: converting file check_text.csv to Unix format...

cat check_text.csv | uniq -c > hitcounts.csv

But here is the output I keep getting in my hitcounts.csv file:

  1 ASV Genus_Species   Taxid
  5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
  1 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
  1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
  1 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

The first line is fine, those are just headers that is not a problem. But I should have seen the following:

  1 ASV Genus_Species   Taxid
  5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
  2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
  1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
  1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

The line with Moscharia solbrigii is coming up twice as if these lines are not identical but they are. What could be the problem? I manually check some of these for extra white spaces but there aren’t any.

Note the uniq man page:

‘uniq’ does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use ‘sort -u’ without
‘uniq’

You can use Ruby with CSV parser and generator:

ruby -r csv -e '
opts={:headers=>true, :col_sep=>"t"}
puts CSV.generate(**opts){ |csv|  
        tbl=CSV.parse($<.read, **opts)
        tally=tbl.map{|row| row.to_a.map(&:last)}.tally
        csv<<["1"]+tbl.to_a[0]
        tally.each{|x,cnt| csv<<[cnt]+x }
        } 
' file.tsv

You can also use two awks and a sort:

awk  'BEGIN{FS=OFS="t"}
FNR==1{print 1,$0; exit}' file.tsv

awk  'BEGIN{FS=OFS="t"}
FNR==1{next}
{cnt[$0]++}
END{for (e in cnt) print cnt[e],e}
' file.tsv | sort -k1,1rn

Either prints:

# These are ordered by header then decreasing frequency
1   ASV Genus_Species   Taxid   
5   cb692dc35226a27b1dc4709405077606    Gymnanthemum    amygdalinum 82755
2   46c5e68b0c1230a4305382e0ddf34c5d    Moscharia   solbrigii   460333
1   46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia    foliosa 1910960
1   46c5e68b0c1230a4305382e0ddf34c5d    Leucheria   candidissima    1911693
1   46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum  ulicinum    540062

Or, if you want to print them in the file order encountered rather than by value:

ruby -lne 'BEGIN{cnt=Hash.new(0)}
if $.==1 then 
    puts "cntt#{$_}"
else
    cnt[$_]+=1
end
END{ cnt.each{|k,v| puts "#{v}t#{k}" } }
' file.tsv

Or awk:

awk 'BEGIN{FS=OFS="t"}
FNR==1{print "cnt", $0; next}
!seen[$0]++ {order[++count] = $0}
END { 
    for (i=1; i<=count; i++) 
        print seen[order[i]], order[i] 
}' file.tsv

Or a two pass awk (for massive files with lots of duplicates):

awk 'BEGIN{FS=OFS="t"}
FNR==1{if (NR==1) print "cnt",$0; next}
FNR==NR{cnt[$0]++; next}
!seen[$0]++{print cnt[$0], $0}
' file.tsv file.tsv

Those three all prints:

# ordered by when line first encountered - not by count.
cnt ASV Genus_Species   Taxid
5   cb692dc35226a27b1dc4709405077606    Gymnanthemum    amygdalinum 82755
2   46c5e68b0c1230a4305382e0ddf34c5d    Moscharia   solbrigii   460333
1   46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia    foliosa 1910960
1   46c5e68b0c1230a4305382e0ddf34c5d    Leucheria   candidissima    1911693
1   46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum  ulicinum    540062

As others have already mentioned, uniq works on groups of contiguous duplicates, not duplicates across the whole of the input.

Using any awk with a 2-pass approach:

$ awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    $0 in cnt {
        print cnt[$0], $0
        delete cnt[$0]
    }
' file file
1 ASV Genus_Species   Taxid
5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

or 1-pass but storing the input in memory:

$ awk '
    {
        cnt[$0]++
        ordr[NR] = $0
    }
    END {
        for ( i=1; i<=NR; i++ ) {
            rec = ordr[i]
            if ( rec in cnt ) {
                print cnt[rec], rec
                delete cnt[rec]
            }
        }
    }
' file
1 ASV Genus_Species   Taxid
5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

I see that @dawg has similar scripts in their answer and I think the differences are interesting and subtle enough to be discussed. Here are @dawgs scripts written using the same style and variable names as I’m using above where applicable, to make the similarities and differences most obvious, and with a tweak where I think it’s useful (IMO we should not use an array named seen[] to store and use/print counts, just to test for it’s index having previously been seen or not):

Ed 1-pass:

awk '
    {
        cnt[$0]++
        ordr[NR] = $0
    }
    END {
        for ( i=1; i<=NR; i++ ) {
            rec = ordr[i]
            if ( rec in cnt ) {
                print cnt[rec], rec
                delete cnt[rec]
            }
        }
    }
' file

dawg 1-pass:

awk '
    !cnt[$0]++ {
        ordr[++numRecs] = $0
    }
    END { 
        for ( i=1; i<=numRecs; i++) {
            rec = ordr[i]
            print cnt[rec], rec
        }
    }
' file

Comparison: dawg is only storing the order for each unique $0 and only looping through the number of unique records whereas I’m storing the $0 for every line and looping through every line then deleting non-unique records. They are doing 1 extra increment and comparison than I am while loading the arrays but that’s probably negligible overhead. Given that, dawg’s script will use less memory and probably be more efficient than my script and so is the better approach.

Ed 2-pass:

awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    $0 in cnt {
        print cnt[$0], $0
        delete cnt[$0]
    }
' file file

dawg 2-pass:

awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    !seen[$0]++ {
        print cnt[$0], $0
    }
' file file

Comparison: we both store the record counts the same way on the first pass, but on the second pass dawg is creating an array seen[] of all of the unique records from the input so they will essentially have 2 copies of the input stored in memory (which means dawgs 2-pass script isn’t actually using any less memory than their 1-pass script) where my approach only has 1 copy of the input stored in memory. dawg is doing a hash lookup plus arithmetic while I’m doing a hash lookup plus a deletion – in terms of execution speed it’s probably not different enough either way to matter. Especially since we’d adopt this 2-pass approach if the input could be huge to avoid memory issues, otherwise we’d use the 1-pass approach, my approach is better as it’s using half the memory of both 1-pass scripts and dawgs 2-pass script.

A decorate-sort-undecorate approach that doesn’t store a lookup table
(so should work with files that don’t fit in memory – sort uses temp files for large input):

awk '{print NR"t"$0}' input.csv |
sort -k2 -k1n,1 |
uniq -c -f1 |
sort -k2n,2 |
sed 's/[0-9]*t//'

prepend line numbers
sort ignoring the line number, then reorder identical lines by the line number
count the duplicates, ignoring the line number
re-sort to original line order
strip the line numbers

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 07:40

Thẻ: bash

Bash code not counting identical lines properly

Here is a reproducible example:

Input file called check_text.csv with lines below:

ASV Genus_Species   Taxid
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

My code I run:

dos2unix check_text.csv
dos2unix: converting file check_text.csv to Unix format...

cat check_text.csv | uniq -c > hitcounts.csv

But here is the output I keep getting in my hitcounts.csv file:

  1 ASV Genus_Species   Taxid
  5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
  1 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
  1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
  1 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

The first line is fine, those are just headers that is not a problem. But I should have seen the following:

  1 ASV Genus_Species   Taxid
  5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
  2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
  1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
  1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

Note the uniq man page:

‘uniq’ does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use ‘sort -u’ without
‘uniq’

You can use Ruby with CSV parser and generator:

ruby -r csv -e '
opts={:headers=>true, :col_sep=>"t"}
puts CSV.generate(**opts){ |csv|  
        tbl=CSV.parse($<.read, **opts)
        tally=tbl.map{|row| row.to_a.map(&:last)}.tally
        csv<<["1"]+tbl.to_a[0]
        tally.each{|x,cnt| csv<<[cnt]+x }
        } 
' file.tsv

You can also use two awks and a sort:

awk  'BEGIN{FS=OFS="t"}
FNR==1{print 1,$0; exit}' file.tsv

awk  'BEGIN{FS=OFS="t"}
FNR==1{next}
{cnt[$0]++}
END{for (e in cnt) print cnt[e],e}
' file.tsv | sort -k1,1rn

Either prints:

# These are ordered by header then decreasing frequency
1   ASV Genus_Species   Taxid   
5   cb692dc35226a27b1dc4709405077606    Gymnanthemum    amygdalinum 82755
2   46c5e68b0c1230a4305382e0ddf34c5d    Moscharia   solbrigii   460333
1   46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia    foliosa 1910960
1   46c5e68b0c1230a4305382e0ddf34c5d    Leucheria   candidissima    1911693
1   46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum  ulicinum    540062

Or, if you want to print them in the file order encountered rather than by value:

ruby -lne 'BEGIN{cnt=Hash.new(0)}
if $.==1 then 
    puts "cntt#{$_}"
else
    cnt[$_]+=1
end
END{ cnt.each{|k,v| puts "#{v}t#{k}" } }
' file.tsv

Or awk:

awk 'BEGIN{FS=OFS="t"}
FNR==1{print "cnt", $0; next}
!seen[$0]++ {order[++count] = $0}
END { 
    for (i=1; i<=count; i++) 
        print seen[order[i]], order[i] 
}' file.tsv

Or a two pass awk (for massive files with lots of duplicates):

awk 'BEGIN{FS=OFS="t"}
FNR==1{if (NR==1) print "cnt",$0; next}
FNR==NR{cnt[$0]++; next}
!seen[$0]++{print cnt[$0], $0}
' file.tsv file.tsv

Those three all prints:

# ordered by when line first encountered - not by count.
cnt ASV Genus_Species   Taxid
5   cb692dc35226a27b1dc4709405077606    Gymnanthemum    amygdalinum 82755
2   46c5e68b0c1230a4305382e0ddf34c5d    Moscharia   solbrigii   460333
1   46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia    foliosa 1910960
1   46c5e68b0c1230a4305382e0ddf34c5d    Leucheria   candidissima    1911693
1   46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum  ulicinum    540062

As others have already mentioned, uniq works on groups of contiguous duplicates, not duplicates across the whole of the input.

Using any awk with a 2-pass approach:

$ awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    $0 in cnt {
        print cnt[$0], $0
        delete cnt[$0]
    }
' file file
1 ASV Genus_Species   Taxid
5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

or 1-pass but storing the input in memory:

$ awk '
    {
        cnt[$0]++
        ordr[NR] = $0
    }
    END {
        for ( i=1; i<=NR; i++ ) {
            rec = ordr[i]
            if ( rec in cnt ) {
                print cnt[rec], rec
                delete cnt[rec]
            }
        }
    }
' file
1 ASV Genus_Species   Taxid
5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

Ed 1-pass:

awk '
    {
        cnt[$0]++
        ordr[NR] = $0
    }
    END {
        for ( i=1; i<=NR; i++ ) {
            rec = ordr[i]
            if ( rec in cnt ) {
                print cnt[rec], rec
                delete cnt[rec]
            }
        }
    }
' file

dawg 1-pass:

awk '
    !cnt[$0]++ {
        ordr[++numRecs] = $0
    }
    END { 
        for ( i=1; i<=numRecs; i++) {
            rec = ordr[i]
            print cnt[rec], rec
        }
    }
' file

Ed 2-pass:

awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    $0 in cnt {
        print cnt[$0], $0
        delete cnt[$0]
    }
' file file

dawg 2-pass:

awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    !seen[$0]++ {
        print cnt[$0], $0
    }
' file file

A decorate-sort-undecorate approach that doesn’t store a lookup table
(so should work with files that don’t fit in memory – sort uses temp files for large input):

awk '{print NR"t"$0}' input.csv |
sort -k2 -k1n,1 |
uniq -c -f1 |
sort -k2n,2 |
sed 's/[0-9]*t//'

prepend line numbers
sort ignoring the line number, then reorder identical lines by the line number
count the duplicates, ignoring the line number
re-sort to original line order
strip the line numbers

Filed under: Kiến thức lập trình - @ 07:40

Thẻ: bash

Thiết kế website giá rẻ

Danh mục

Bash code not counting identical lines properly

Bash code not counting identical lines properly