I have multiple TSV file with these columns, want to count number of occurences in a column and create a new file with estimated values for each Sample.
Please find below input tables and expected output
Sample | Ind | Start | ID | Status |
---|---|---|---|---|
ID1 | 1 | 1 | ABC_765 | Present |
ID2 | 1 | 2 | ABC_765 | Present |
ID1 | 1 | 3 | DEF_444 | Present |
ID2 | 1 | 5 | DEF_444 | Present |
ID1 | 1 | 5 | HJG_3567 | Present |
ID1 | 1 | 5 | EDC_655 | Absent |
ID2 | 1 | 5 | IUT_897 | Absent |
Need outputs:
- Count value occurences in column 4 Ex:
ID | Count | Status |
---|---|---|
ABC_765 | 2 | Present |
DEF_444 | 2 | Present |
HJG_3567 | 1 | Present |
- Create new tsv files respective to count value:
File 1. ID1_ID2_2.tsv
ID1_ID2 | Count | Status |
---|---|---|
ABC_765 | 2 | Present |
DEF_444 | 2 | Present |
File 2. ID1_1.tsv
|ID1 | Count | Status |
| —— | — | —- |
|EDC_655| 1 | Absent |
File 3 : ID2_1.tsv
ID2 | Count | Status |
---|---|---|
IUT_897 | 1 | Absent |
So far tried:
BEGIN {
FS = "t"
OFS = "t"
{
counts[$4]++
values[$1]
}
END {
print "Value", "Estimated_Count"
for (value in values) {
estimated_count = 0
for (count_value in counts) {
estimated_count += counts[count_value]
}
print value, estimated_count
}
}