Working on a project that pulls articles from 20+ rss feeds and the the variety of feed formats for where the articles image is, is driving me bonkers.
I’m happy for the articles image to be missing in some cases, and I’ll fall back to a set image but I don’t want that happening more than 20% of the time.
Using Python & Feedparser (DJANGO Project) I’ve gotten this far:
# Try and get the image url try: # Looking for images embedded in enclosures post_image = sanitize_url(str(c.enclosures[0].href)) except: try: # Looking for images embedded in media content post_image = sanitize_url(str(c.media_content[0]["url"])) except: try: # Looking for images embedded in content post_image = sanitize_url(c.content[0]["value"]) except: post_image = "no image found"
Had to build a sanitise URL function that does 3 things:
Takes a string that extracts out a URL (in situations where the above pulls more than the URL)
Trims URLS down to remove parameters that might cause problems i.e 342.jpg?w=25
Then does a final check of the string to make sure it has a mimetype in case the final url is not an image.
I’m going in circles here, for every example of an RSS feed this does work on, there’s other RSS feeds it doesn’t work on. The variety of RSS structures is driving me a bit batty.
Any advice on what I should do here? I can’t keep adding more and more functions and loops to it.