I have lots of line fragments like this:
MYLINEBREAK01r r SURNAME, Name (LT)r rnMYBREAK01
It comes from processing a large html file with rvest::html_text2()
. Long story short – it is unwieldy to process the file by nodes using xml2 parser – it takes too much time. If I strip the text of HTML the text has certain regularities that can be exploited. For example I have already inserted placeholders MYBREAK01 and MYLINEBREAK01. I get a bit over my head when trying to get rid of unneeded r
and n
(carriage returns and linefeeds that may be interspersed with spaces – or at least they appear to be spaces).
I tried to put in %>% gsub()
in the processing chain that should get rid of these characters, but I have problems matching and I do not quite know what I am doing wrong:
gsub("(MYLINEBREAK01)(r|rn| |n)+([a-zA-Z ()]+)(r|rn| n)+(MYBREAK01)","\1\3\5",.)
but it does not appear to match what I want – string fragment stays unchanged. And (LT) type thing does not always appear in the field. My aim is to get MYLINEBREAK01SURNAME, Name (LT)MYBREAK01
string, of course – without (LT) if it is not there.
Many thanks!