I am trying to scrape with jsoup in java, obtain information from the “stdregistro” table and save it in a table in my database at this URL:
pad.minem.gob.pe/REINFO_WEB/Index.aspx
but I only get information from the first page and not from the following pages of the table
I will show you an example of my code so far, where I get the information but only from the first page
public class apiscraping {
private static final String URL = "https://pad.minem.gob.pe/REINFO_WEB/Index.aspx";
private static final String JDBC_URL = "jdbc:sqlserver://xxxxx:1433;databaseName=xxxxx;user=xxxx;password=xxxxx";
public static void main(String[] args) {
try {
scrapeAndStore();
} catch (Exception e) {
System.out.println(e.getMessage());
e.printStackTrace();
}
}
private static void scrapeAndStore() throws Exception {
CloseableHttpClient httpClient = HttpClients.createDefault();
int pageNumber = 1;
boolean hasNextPage = true;
while (hasNextPage) {
Document doc = Jsoup.connect(URL).get();
Elements rows = doc.select("#stdregistro tr");
for (Element row : rows) {
Elements columns = row.select("td");
if (columns.size() > 0) {
String ruc = columns.get(2).text();
String mineroFormal = columns.get(3).text();
String codigoUnico = columns.get(4).text();
String nombre = columns.get(5).text();
String departamento = columns.get(6).text();
String provincia = columns.get(7).text();
String distrito = columns.get(8).text();
String estado = columns.get(9).text();
saveToDatabase(ruc, mineroFormal, codigoUnico, nombre, departamento, provincia, distrito, estado);
}
}
hasNextPage = navigateToNextPage(httpClient, doc);
System.out.println("Processed page: " + pageNumber);
pageNumber++;
}
httpClient.close();
}
private static boolean navigateToNextPage(CloseableHttpClient httpClient, Document doc) throws IOException {
Element nextButton = doc.select("#ImgBtnSiguiente").first();
if (nextButton == null || nextButton.hasAttr("disabled")) {
return false;
}
String viewState = doc.select("input[name=__VIEWSTATE]").val();
String eventValidation = doc.select("input[name=__EVENTVALIDATION]").val();
String viewStateGenerator = doc.select("input[name=__VIEWSTATEGENERATOR]").val();
String txtpagina = doc.select("input[name=txtpagina]").val();
int nextPage = Integer.parseInt(txtpagina) + 1;
HttpPost post = new HttpPost(URL);
List<BasicNameValuePair> params = new ArrayList<>();
params.add(new BasicNameValuePair("__VIEWSTATE", viewState));
params.add(new BasicNameValuePair("__EVENTVALIDATION", eventValidation));
params.add(new BasicNameValuePair("__VIEWSTATEGENERATOR", viewStateGenerator));
params.add(new BasicNameValuePair("txtpagina", String.valueOf(nextPage)));
params.add(new BasicNameValuePair("ImgBtnSiguiente.x", "1"));
params.add(new BasicNameValuePair("ImgBtnSiguiente.y", "1"));
post.setEntity(new UrlEncodedFormEntity(params));
CloseableHttpResponse response = httpClient.execute(post);
Document nextDoc = Jsoup.parse(response.getEntity().getContent(), "UTF-8", URL);
response.close();
return nextDoc != null;
}
private static void saveToDatabase(String ruc, String mineroFormal, String codigoUnico, String nombre, String departamento, String provincia, String distrito, String estado) {
String sql = "INSERT INTO OPE_REINFO (OPE_RUC, OPE_MINEROFORMAL, OPE_CODIGOUNICO, OPE_NOMBRE, OPE_DEPARTAMENTO, OPE_PROVINCIA, OPE_DISTRITO, OPE_ESTADO) VALUES (?, ?, ?, ?, ?, ?, ?, ?)";
try (Connection conn = DriverManager.getConnection(JDBC_URL);
PreparedStatement pstmt = conn.prepareStatement(sql)) {
pstmt.setString(1, ruc);
pstmt.setString(2, mineroFormal);
pstmt.setString(3, codigoUnico);
pstmt.setString(4, nombre);
pstmt.setString(5, departamento);
pstmt.setString(6, provincia);
pstmt.setString(7, distrito);
pstmt.setString(8, estado);
pstmt.executeUpdate();
System.out.println("Inserted record: RUC = " + ruc + ", Minero = " + mineroFormal + ", Código Único = " + codigoUnico);
} catch (SQLException e) {
System.out.println("Error inserting record: " + e.getMessage());
}
}
}
any suggestion, help is welcome, thanks
I think the problem lies in how information is obtained from said table. Since I am new to scraping, I am still trying, but any advice is welcome.