Hi I am using a craigslist to do a web scraping assignment here the link
https://sacramento.craigslist.org/search/apa#search=1~gallery~0~4
What I have to do is make a table that has
the rent amount • number of bedrooms • number of bathrooms • type of housing • square-footage • address or location information • whether they allow pets • amenities such as – laundry, – parking – garage (attached, detached, carport), street, off-street, – gym – internet – furnished – pool – EV charging – Air conditioning – . . . • length of lease • security deposit • utilities included or not?
Here my code for testing
library(rvest)
library(xml2) # Load xml2 for handling XML and HTML content
# Function to extract data from a single page
scrape_page_info <- function(link) {
page <- read_html(link)
# Extract prices, titles, locations, and sqfts using CSS selectors
prices <- page %>% html_elements(".price") %>% html_text()
titles <- page %>% html_elements(".title") %>% html_text()
locations <- page %>% html_elements(".location") %>% html_text()
sqfts <- page %>% html_nodes(xpath='//*[@id="search-results-page-1"]/ol/li[1]/div/div[2]/span[2]/span[2]') %>% html_text()
sqfts <- gsub("[^0-9]", "", sqfts)
# Handling sqft to ensure alignment
if (length(sqfts) < length(prices)) {
sqfts <- c(sqfts, rep(NA, length(prices) - length(sqfts)))
}
# Ensure only complete entries are included
min_length <- min(length(prices), length(titles), length(locations), length(sqfts))
data <- data.frame(
Title = titles[1:min_length],
Price = prices[1:min_length],
Location = locations[1:min_length],
SqFt = sqfts[1:min_length]
)
return(data)
}
# Example usage with a real link, these are placeholder links and won't work
link_davis <- "https://sacramento.craigslist.org/search/apa?search_distance=10"
link_SF <- "https://sfbay.craigslist.org/search/apa?search_distance=6"
all_Davisdata <- data.frame()
all_SFdata <- data.frame()
# Loop through the first 5 pages
for (i in 0:4) {
start_param <- i * 120 # Each page shows 120 items
page_url_davis <- paste0(link_davis, "&s=", start_param)
page_data_davis <- scrape_page_info(page_url_davis)
all_Davisdata <- rbind(all_Davisdata, page_data_davis) # Accumulate data
Sys.sleep(2) # Pause to be polite to the server
page_url_SF <- paste0(link_SF, "&s=", start_param)
page_data_SF <- scrape_page_info(page_url_SF)
all_SFdata <- rbind(all_SFdata, page_data_SF)
Sys.sleep(2)
}
# Print and analyze the first 5 pages of data from Davis and SF
print(all_Davisdata)
print(all_SFdata)
I know that you have to use Rselenium to do so for information inside each post but for sqft it can be don’t with CSS but When I try to do it I it always return NA, I try CSS I try Xpath I try everything I don’t know what is wrong.
Dabin Xuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.