I’m trying to adapt the following code to scrape links from various pages, ex. if 1 page has 40 links, for 10 pages I expect to get 400 links.
The webpages follow this pattern:
htt ps://www.examples.com/user/username#page1-videos (the “1” in “page1” is the varying element).
The links on the webpage follow this pattern:
htt ps://www.example.com/video/1423061
I have a few questions:
-
The original code refers to “name_list” and “link_list”, I don’t need the “name” column in the final csv, just 1 column (ie. urls). I tried simply deleting everything involving name_list, but the df end up being empty. How do I correct this?
-
I want to put all the urls to srape in a .txt file, and have the code iterate thru each line in the txt. How do I do that?
import requests
from bs4 import BeautifulSoup
import pandas as pd
i=0
name_list =[]
link_list = []
while(i<=13780):
print("https://jito.org/members?start={}".format(i))
res=requests.get("https://jito.org/members?start={}".format(i))
soup=BeautifulSoup(res.text,"html.parser")
for item in soup.select('.name>a'):
name_list.append(item.text)
link_list.append("https://jito.org" + item['href'])
i=i+20
print(name_list)
print(link_list)
df=pd.DataFrame({"Name":name_list,"Link":link_list})
print(df)
df.to_csv('JITO_Directory.csv', index=False)
print('Done')
Each webpage I’m scraping has 40 links, so if I scrape 2 pages:
http s://www.example.com/user/person#page2
http s://www.example.com/user/person3#page7
I want a txt or csv file with 80 links scraped rom the 2 pages:
htt ps://www.example.com/video/1423061
htt ps://www.example.com/video/1417634
htt ps://www.example.com/video/1417639
Edwin Szeto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.