There is an array of data:
https://example.com:description of the site/application:category
http://example.com:description of the site/application:category
android://package name:description of the site/application:category
android://package name|description of the site/application|category
I want to split the data into 3 columns:
URL | Description | Category |
---|---|---|
https://example.com | description of the site/application | category |
http://example.com | description of the site/application | category |
android://package name | description of the site/application | category |
android://package name | description of the site/application | category |
As I understand it, it is necessary to add a regEx to ignore the first “:” and also 2 argument for the divisor “|”
I tried this expression, but the output is incorrect
cat * | awk -F["|"][:] '{print $1,$2, $3}'
Fresh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
6
Using any awk:
$ awk -F'[:|]' -v OFS='t' '{sub(/:/,RS); sub(RS,":",$1)} 1' file
https://example.com description of the site/application category
http://example.com description of the site/application category
android://package name description of the site/application category
android://package name description of the site/application category
or, if the OFS
character can’t be present in the URL in the input:
$ awk -F'[:|]' -v OFS='t' '{$1=$1; sub(OFS,":")} 1' file
https://example.com description of the site/application category
http://example.com description of the site/application category
android://package name description of the site/application category
android://package name description of the site/application category
Set OFS
to something other than t
as you see fit.
Please read the POSIX spec to learn what bracket expression such as the ones you used, ["|"][:]
, and the one I used, [:|]
, mean.
Having said that, I suspect the OPs real input probably looks something like this (where additional :
s or |
s can appear in the URL and/or description, but no literal blanks can be in the URL):
$ cat file
https://example.com:description of : the site/application:category
http://example.com:description: of the site/application:category
android://package%20name:description of the site/application:category
android://package%20name|description of the site/application|category
android://package_name:17:something:description of the :huge: site/application:category
and then you can get the output you want using the following sed
script (using a sed that has -E
to enable EREs, e.g. GNU and BSD seds):
$ sed -E 's/([^ ]+)[:|]([^ ].*)[:|]/1t2t/' file
https://example.com description of : the site/application category
http://example.com description: of the site/application category
android://package%20name description of the site/application category
android://package%20name description of the site/application category
android://package_name:17:something description of the :huge: site/application category
or using any sed:
$ sed 's/([^ ]*)[:|]([^ ].*)[:|]/1t2t/' file
https://example.com description of : the site/application category
http://example.com description: of the site/application category
android://package%20name description of the site/application category
android://package%20name description of the site/application category
android://package_name:17:something description of the :huge: site/application category
Those sed commands assume a description contains at least 1 blank and doesn’t start with a :
or word:word
– if that’s not the case then there is no way to separate a description from a URL given what we know so far about the input.
2
If format can be one of:
url:description:category
url|description|category
and only url
can contain extra :
or |
, then to change to tab separators with sed
:
sed -E 's/(.*)([:|])(.*)2(.*)/1t3t4/' file
or slightly more efficiently:
sed -E 's/(.*)([:|])(.*)2/1t3t/' file
The first .*
consumes as much as possible, so also consumes any surplus :
and |
Change t
if tabs are not the desired new separators.
You may notice that you have 4 fields per line when splitting with [:|]
:
awk -F '[:|]' '{ print NF }' infile
Output:
4
4
4
4
So, assuming all your lines are formatted in this way, you can get 3 columns by joining fields 1 and 2, e.g.:
awk -F '[:|]' -v OFS='t' '{ print $1":"$2, $3, $4 }' infile
Output:
https://example.com description of the site/application category
http://example.com description of the site/application category
android://package name description of the site/application category
android://package name description of the site/application category
9