I am working on a project to scrape web pages using Scrapy in Python. I want to extract all visible and not text from a web page. In addition to all text, I would also like to extract email addresses from mailto
.
I’m sure there is a way to use Scrapy to extract all the text, but now I’m extracting by tags. What I mean is, I have an array of tags ['a', 'p', 'h1', ... 'span']
and I extract text from them by iterating over these tags, but this is not what I need because this text is added to the end of the line. I want all the text from the page to be organized.
Please help me with this.
import scrapy
from scrapy.spiders import CrawlSpider
class Spider(CrawlSpider):
name = "webScraper"
def parse(self, response):
extracted_items = []
for tag in ['a', 'p', 'h1', 'span']:
element_texts = page_content.css(f"{tag}::text").getall()
if element_texts:
for element_text in element_texts:
processed_text = re.sub(r"s+", " ", element_text).strip()
if processed_text:
self.text_counter[processed_text] += 1
extracted_items.append(processed_text)
return extracted_items
It works, but not the way I want it to.
For example this site:
For some reason, I didn’t get anything out of it. I would like to get names, addresses, phone numbers, etc.
What I need:
-
A way to extract all visible text from the web page as it appears on the rendered page, maintaining the flow of the content.
-
A method to extract email addresses from
mailto
links on the page. -
Extract phone/first name/last name
koder228 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
6