The goal of my current project is to load almost 31M rows of dataset into R studio where I want to conduct Exploratory Analysis and conduct research. I cannot use advanced statistical analysis on just SQL server so I have to use R programming language to complete my research.
So far what I have tried and what works is below fetching records using the odbc package:
library(RODBC)
library(DBI)
library(odbc)
conn <- odbcConnect("Server")
sqlQuery(conn,"USE database1")
df <- sqlQuery(conn, "SELECT * FROM table1;")
close(conn)
But if I change the code for Spark to below :
library(RODBC)
library(DBI)
library(odbc)
library(data.table)
library(arrow)
library(sparklyr)
library(dplyr)
library(SparkR)
# Set SPARK_HOME if not already set
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "/home/spark")
}
# Set JAVA_HOME
Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jre-1.8")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Path to the JDBC driver (assuming Java 8)
jdbc_driver_path <- "C:/AppData/Local/sqljdbc_12.6/enu/jars/mssql-jdbc-12.6.3.jre8.jar"
# Start Spark session with the JDBC driver added to the classpath
sparkR.session(master = "local[*]",
sparkConfig = list(
spark.driver.memory = "2g",
spark.driver.extraClassPath = jdbc_driver_path
))
# Function to load data using SparkR:
load_data_sparkR <- function(query, server, database) {
# Create a JDBC URL using the server and database name
jdbc_url <- paste0("jdbc:sqlserver://", server, ";databaseName=", database, ";integratedSecurity=true;")
# Read data from the SQL Server using the JDBC URL
df <- read.jdbc(jdbc_url, query,
properties = list(
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
))
return(df)
}
# Server and Database for Customer X:
server <- "Server1"
database <- "DB1"
# Load data from Tables:
df1 <- load_data_sparkR("SELECT * FROM table1;", server, database)
df2 <- load_data_sparkR("SELECT * FROM table2;", server, database)
# Show the first few rows of the data
head(df1)
head(df2)
# Stop the Spark session
sparkR.session.stop()
I get a Java runtime error and that the server driver ‘com.microsoft.sqlserver.jdbc.SQLServerDriver’ is not correct what to do?
I am sure Java is running fine on my machine. Any help would be appreciated.