I need to get a count of how many unique blast hits I’ve obtained for my list of OTUs/ASVs but for some reason, in what appears to be random cases lines that are identical are not adding up properly. Other times however, the code works fine and is summing up identical lines no problem. I have already made sure to run dos2unix on my files, I’m running my command on Linux and I don’t know what else to check for.
Here is a reproducible example:
Input file called check_text.csv with lines below:
ASV Genus_Species Taxid
cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
46c5e68b0c1230a4305382e0ddf34c5d Marticorenia foliosa 1910960
46c5e68b0c1230a4305382e0ddf34c5d Leucheria candidissima 1911693
46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
46c5e68b0c1230a4305382e0ddf34c5d Oxyphyllum ulicinum 540062
My code I run:
dos2unix check_text.csv
dos2unix: converting file check_text.csv to Unix format...
cat check_text.csv | uniq -c > hitcounts.csv
But here is the output I keep getting in my hitcounts.csv file:
1 ASV Genus_Species Taxid
5 cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
1 46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d Marticorenia foliosa 1910960
1 46c5e68b0c1230a4305382e0ddf34c5d Leucheria candidissima 1911693
1 46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d Oxyphyllum ulicinum 540062
The first line is fine, those are just headers that is not a problem. But I should have seen the following:
1 ASV Genus_Species Taxid
5 cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
2 46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d Marticorenia foliosa 1910960
1 46c5e68b0c1230a4305382e0ddf34c5d Leucheria candidissima 1911693
1 46c5e68b0c1230a4305382e0ddf34c5d Oxyphyllum ulicinum 540062
The line with Moscharia solbrigii is coming up twice as if these lines are not identical but they are. What could be the problem? I manually check some of these for extra white spaces but there aren’t any.
8
Note the uniq man page:
‘uniq’ does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use ‘sort -u’ without
‘uniq’
You can use Ruby with CSV parser and generator:
ruby -r csv -e '
opts={:headers=>true, :col_sep=>"t"}
puts CSV.generate(**opts){ |csv|
tbl=CSV.parse($<.read, **opts)
tally=tbl.map{|row| row.to_a.map(&:last)}.tally
csv<<["1"]+tbl.to_a[0]
tally.each{|x,cnt| csv<<[cnt]+x }
}
' file.tsv
You can also use two awks and a sort:
awk 'BEGIN{FS=OFS="t"}
FNR==1{print 1,$0; exit}' file.tsv
awk 'BEGIN{FS=OFS="t"}
FNR==1{next}
{cnt[$0]++}
END{for (e in cnt) print cnt[e],e}
' file.tsv | sort -k1,1rn
Either prints:
# These are ordered by header then decreasing frequency
1 ASV Genus_Species Taxid
5 cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
2 46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d Marticorenia foliosa 1910960
1 46c5e68b0c1230a4305382e0ddf34c5d Leucheria candidissima 1911693
1 46c5e68b0c1230a4305382e0ddf34c5d Oxyphyllum ulicinum 540062
Or, if you want to print them in the file order encountered rather than by value:
ruby -lne 'BEGIN{cnt=Hash.new(0)}
if $.==1 then
puts "cntt#{$_}"
else
cnt[$_]+=1
end
END{ cnt.each{|k,v| puts "#{v}t#{k}" } }
' file.tsv
Or awk:
awk 'BEGIN{FS=OFS="t"}
FNR==1{print "cnt", $0; next}
!seen[$0]++ {order[++count] = $0}
END {
for (i=1; i<=count; i++)
print seen[order[i]], order[i]
}' file.tsv
Or a two pass awk (for massive files with lots of duplicates):
awk 'BEGIN{FS=OFS="t"}
FNR==1{if (NR==1) print "cnt",$0; next}
FNR==NR{cnt[$0]++; next}
!seen[$0]++{print cnt[$0], $0}
' file.tsv file.tsv
Those three all prints:
# ordered by when line first encountered - not by count.
cnt ASV Genus_Species Taxid
5 cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
2 46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d Marticorenia foliosa 1910960
1 46c5e68b0c1230a4305382e0ddf34c5d Leucheria candidissima 1911693
1 46c5e68b0c1230a4305382e0ddf34c5d Oxyphyllum ulicinum 540062
1
As others have already mentioned, uniq
works on groups of contiguous duplicates, not duplicates across the whole of the input.
Using any awk with a 2-pass approach:
$ awk '
NR==FNR {
cnt[$0]++
next
}
$0 in cnt {
print cnt[$0], $0
delete cnt[$0]
}
' file file
1 ASV Genus_Species Taxid
5 cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
2 46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d Marticorenia foliosa 1910960
1 46c5e68b0c1230a4305382e0ddf34c5d Leucheria candidissima 1911693
1 46c5e68b0c1230a4305382e0ddf34c5d Oxyphyllum ulicinum 540062
or 1-pass but storing the input in memory:
$ awk '
{
cnt[$0]++
ordr[NR] = $0
}
END {
for ( i=1; i<=NR; i++ ) {
rec = ordr[i]
if ( rec in cnt ) {
print cnt[rec], rec
delete cnt[rec]
}
}
}
' file
1 ASV Genus_Species Taxid
5 cb692dc35226a27b1dc4709405077606 Gymnanthemum amygdalinum 82755
2 46c5e68b0c1230a4305382e0ddf34c5d Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d Marticorenia foliosa 1910960
1 46c5e68b0c1230a4305382e0ddf34c5d Leucheria candidissima 1911693
1 46c5e68b0c1230a4305382e0ddf34c5d Oxyphyllum ulicinum 540062
I see that @dawg has similar scripts in their answer and I think the differences are interesting and subtle enough to be discussed. Here are @dawgs scripts written using the same style and variable names as I’m using above where applicable, to make the similarities and differences most obvious, and with a tweak where I think it’s useful (IMO we should not use an array named seen[]
to store and use/print counts, just to test for it’s index having previously been seen or not):
Ed 1-pass:
awk '
{
cnt[$0]++
ordr[NR] = $0
}
END {
for ( i=1; i<=NR; i++ ) {
rec = ordr[i]
if ( rec in cnt ) {
print cnt[rec], rec
delete cnt[rec]
}
}
}
' file
dawg 1-pass:
awk '
!cnt[$0]++ {
ordr[++numRecs] = $0
}
END {
for ( i=1; i<=numRecs; i++) {
rec = ordr[i]
print cnt[rec], rec
}
}
' file
Comparison: dawg is only storing the order for each unique $0 and only looping through the number of unique records whereas I’m storing the $0 for every line and looping through every line then deleting non-unique records. They are doing 1 extra increment and comparison than I am while loading the arrays but that’s probably negligible overhead. Given that, dawg’s script will use less memory and probably be more efficient than my script and so is the better approach.
Ed 2-pass:
awk '
NR==FNR {
cnt[$0]++
next
}
$0 in cnt {
print cnt[$0], $0
delete cnt[$0]
}
' file file
dawg 2-pass:
awk '
NR==FNR {
cnt[$0]++
next
}
!seen[$0]++ {
print cnt[$0], $0
}
' file file
Comparison: we both store the record counts the same way on the first pass, but on the second pass dawg is creating an array seen[]
of all of the unique records from the input so they will essentially have 2 copies of the input stored in memory (which means dawgs 2-pass script isn’t actually using any less memory than their 1-pass script) where my approach only has 1 copy of the input stored in memory. dawg is doing a hash lookup plus arithmetic while I’m doing a hash lookup plus a deletion – in terms of execution speed it’s probably not different enough either way to matter. Especially since we’d adopt this 2-pass approach if the input could be huge to avoid memory issues, otherwise we’d use the 1-pass approach, my approach is better as it’s using half the memory of both 1-pass scripts and dawgs 2-pass script.
2
A decorate-sort-undecorate approach that doesn’t store a lookup table
(so should work with files that don’t fit in memory – sort
uses temp files for large input):
awk '{print NR"t"$0}' input.csv |
sort -k2 -k1n,1 |
uniq -c -f1 |
sort -k2n,2 |
sed 's/[0-9]*t//'
- prepend line numbers
- sort ignoring the line number, then reorder identical lines by the line number
- count the duplicates, ignoring the line number
- re-sort to original line order
- strip the line numbers
1