I’m trying to remove duplicates from a large csv file, based con values in column 1, but considering this:
Column 3 could be empty or have multiple values separated by :::
If there are more than one repeated value in column1, keep the record that has maximun number of elements inside the column 3.
Remove the -
between numbers in column3 in case it exists.
My input is:
H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+15-00:::01354,1
b,4,1111:::+1500,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426,4
d,5,7761:::+993-733:::+53-426:::87425,4
d,5,7761:::53-426,4
The output I’m trying to get is:
H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+1500:::01354,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426:::87425,4
My current script only removes duplicates without the other considerations, since I’m don’t how to mix both scripts and how to add the
condition to keep the record that has more elements in column 3.
awk -F, '{ gsub(/-/,"", $3); print } ' input.csv > input_without_hyphen.csv
awk -F',' -v OFS=',' '!a[$1]++' input_without_hyphen.csv > output.csv
Thanks for any help.