I need to match complete blogger.googleusercontent.com
image link URLs that include the /img/a/
subdirectories. The URLs are for images, and the file names don’t have file extensions, but that may not matter.
These are two sample URLs from a large text file dump of HTML. There is a lot of HTML markup, but there are spaces before href
and after the closing ” of the URLs.
href="https://blogger.googleusercontent.com/img/a/AVvXsEhb-vB59M2NTWyWvDlMemRdgT0XMKdjB4NMH02iP4Nb7HbzHwq5ZObxjEC1_oLne6xpUhIkrkpyWEdMX9ck-aU5h1JXdpSw-GhbV90QBEi2xigGLQdoSswWuQtPNNCyWMRJiT2XnEadx170jUDbtL-AQKKzYyarCoj8=s1727"
src="https://blogger.googleusercontent.com/img/a/AVvXsEhb-vB59M2NTWyWvDlMemRdgT0XMKdjB4NMH02iP4Nb7HbzHwq5ZObxjEC1_oLne6xpUhIkrkpyWEdMX9ck-aU5h1JXdpSw-GhbV90QBEi2xigGLQdoSswWuQtPNNCyWMRJiT2XnEadx170jUDbtL-AQKKzYyarCoj8=w400-h183"
What I am using is this:
/img/a/[^/]
And that matches
/img/a/A
I don’t really need to match the capital A. But I do need to expand the match to find the entire URL, from https to the end “.
Fiddle: https://regex101.com/r/txLWcO/1