I’m trying to filter lines in a CSV file based on two specific conditions, using awk:
- Field 2 should be null (empty) and Field 4 should be non-null (not empty).
- Field 2 should be non-null (not empty) and Field 4 should be null (empty).
For example, consider a CSV file with the following content:
"venBio","http://www.venbio.com","","venBio is a Investor located in United States, North America."
"zhenZhou","http://www.zhenzhou.com","",""
"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
I want the awk script to first print:
"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
followed by
"zhenZhou","http://www.zhenzhou.com","",""
I’ve tried various approaches, but I can’t seem to get the filtering right for these conditions. Could someone guide me on the proper awk syntax to achieve this?
Approaches tried:
awk -F, '($2 == "" && $4 != "")' input.csv
awk -F, '($2 != "" && $4 == "")' input.csv
awk -F, '($2 ~ /^[[:space:]]*$/ && $4 !~ /^[[:space:]]*$/)' input.csv
awk -F, '($2 !~ /^[[:space:]]*$/ && $4 ~ /^[[:space:]]*$/)' input.csv
4
You may use this awk
:
awk -F, -v nul='""' '$2 != nul && $4 == nul {s = s $0 ORS}
$2 == nul && $4 != nul; END {printf "%s", s}' file
"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
"zhenZhou","http://www.zhenzhou.com","",""
3
The One True Awk supports CSV directly but XOR still has to be implemented as NOT/AND/OR:
awk --csv '(!length($2)&&length($4)) || (length($2)&&!length($4))' input.csv
giving:
"zhenZhou","http://www.zhenzhou.com","",""
"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
If you really want the output in the order from the question, you can run the command twice:
awk --csv '!length($2)&&length($4)' input.csv
awk --csv 'length($2)&&!length($4)' input.csv
giving:
"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
"zhenZhou","http://www.zhenzhou.com","",""
Assumptions:
- all fields are wrapped in double quotes
- no fields include embedded/escaped double quotes
If we use -F'"'
to designate the input field delimiter then we just need to remember that all data fields are actually even-numbered (ie, 1st field is referenced as $2
, 2nd field is referenced as $4
).
Modifying and combining OP’s 1st two awk
scripts:
awk -F'"' '
FNR==NR { if ($4 == "" && $8 != "") print; next } # process 1st file
{ if ($4 != "" && $8 == "") print } # process 2nd file
' input.csv input.csv
NOTES:
$4
is the 2nd data field;$8
is the 4th data fieldprint
with no args is treated the same asprint $0
(ie, print the entire line as is)- we provide two references of the input file (
input.csv
) in order to limit memory usage while generating the output in the desired order
This generates:
"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
"zhenZhou","http://www.zhenzhou.com","",""
The problem with the commands you tried is that awk is not ignoring the double quotes within each field. You can use the separator argument to solve that:
awk -F'","' '($2 == "" && $4 != """) || ($2 != "" && $4 == """)' file.csv
Because field $4
is the last one, you still need to account for that last double quote.
If you want to use a true csv parser (other than on GNU awk) you can use Ruby:
ruby -r csv -ne 'BEGIN{a=[]}
li=CSV.parse_line($_); a.unshift($_) if li[1].empty? ^ li[3].empty?
END{puts a.join }' f.csv
Prints:
"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
"zhenZhou","http://www.zhenzhou.com","",""
IIUC, for the latest GNU awk 5.3.0:
awk --csv '($2=="") != ($4=="")' file
#"zhenZhou","http://www.zhenzhou.com","",""
#"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
Note: this one only works when order does not matter.
To make desired order, another two-pass solution can be:
awk --csv '$(2+n) == "" && $(4-n) != ""' file n=2 file
#"Loren","","","A famous Hollywood actress from the 1950s and 1960s"
#"zhenZhou","http://www.zhenzhou.com","",""
so with the first pass, n=""
, we test against ($(2+0),$(4-0))
or ($2,$4)
, for the second pass n=2
and we test ($(2+2), $(4-2)
which is ($4,$2)
.
This can also be written as $(lshift(2,n)) == "" && $(rshift(4,n)) != ""
and use n=1
for the 2nd pass.