The sample code, and the idea as follows
while read url; do
wget -q $url -O - | grep -o -E 'href="([^"#]+)"' | grep "magazine/" | grep "https" | sort -u | sed -r 's/.*href="([^"]+).*/1/g' >> list1
perl -ne 'print unless $dup{$_}++;' list1 > list
done < list
The initial line of the list is https://abc.xyz/issues/
, from here wget
wants to find one particular url link to the previous “issue”, of the exact format of https://abc.xyz/issues/yyyy/mm/dd
(filter by grep, remove duplicates by sort, and extract the url link by sed), then add the url link to the “list”, such url link then be used to get the next url link in the “while read line” loop…and the perl line wants to remove duplicates in the list once new url link(s) added to the list, before new url link being processed in the loop.
So this is the idea, and the ideal result should be a list with hundreds of url links for all past issues. Would appreciate some suggestions or better, a simple solution (myself has very basic knowledge of shell command)
cimba8 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
5