Intro and System
I am trying to access a Redshift database from R. I am on an Apple M3 Pro machine on Sonoma 14.5. I am expecting to give the database name (e.g., “dev”), the host URL, the port number, my username, and my password.
I have tried using a few packages, described below.
Attempt 1: RPostgres
{RPostgres}
is the most straightforward. I used this post and the official package documentation to help set up.
First, I installed the necessary system software by running brew install libpq
. I also elsewhere saw that brew install postgresql
may help, too, so I have those installed.
I installed the package using install.packages("RPostgres")
as well as installing it from source using remotes::install_github("r-dbi/RPostgres")
. I then ran:
con <- dbConnect(
RPostgres::Redshift(),
dbname = "dev",
host = Sys.getenv("RS_URLS"),
port = 5439,
user = Sys.getenv("RS_USER"),
password = Sys.getenv("RS_PASS"),
sslmode = "require"
)
Where the environmental variables are the Redshift URL, username, and password for my database. No matter what I tried, I keep getting the error:
Error: connection to server at "<host>" (<ip>), port 5439 failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
This appears to be a known and open issue.
Attempt 2: RPostgreSQL
{RPostgreSQL}
says it will also use libpq
in the Description
to the package on CRAN. Using the syntax from the example in the documentation:
drv <- dbDriver("PostgreSQL")
con <- dbConnect(
drv,
dbname = "dev",
host = Sys.getenv("RS_URLS"),
port = 5439,
user = Sys.getenv("RS_USER"),
password = Sys.getenv("RS_PASS")
)
Throws the error:
Error in postgresqlNewConnection(drv, ...) :
RPosgreSQL error: could not connect <user>@<host>:5439 on dbname "dev": FATAL: no pg_hba.conf entry for host "???", user "<user>", database "dev", SSL off
Note on SSL that I have successfully hooked Redshift up to a database visualizer on my computer, and it required SSL to be on. Candidy, I have no clue what SSL is or means, but there isn’t an option for it here (while there is in {RPostgres}
but it still failed with that flagged).
Attempt 3: odbc
Posit (fka RStudio) officially recommends using ODBC via the {odbc}
package. There is comprehensive documentation:
- Package vignette for setting up
- Setting up a driver
- Amazon guide to downloading their ODBC Redshift driver
- Format for connecting to Redshift
Starting with the vignette, I ran brew install unixodbc
as well as brew install psqlodbc
. However, I don’t think that’s the driver to use since I want to use Redshift. So I followed the instructions on that Amazon guide for downloading, installing, and formatting the .ini
files for the driver.
Per Amazon’s recommendation, I took the odbcinst.ini
file from the Setup
subdirectory from the install and copied it to my home directory, prepended with .
to keep it hidden. When I cat
that file, it reads:
[ODBC Drivers]
Amazon Redshift=Installed
[Amazon Redshift]
Description=Amazon Redshift ODBC Driver
Driver=/opt/amazon/redshift/lib/libamazonredshiftodbc.dylib
The issue is when I run odbcinst -j
in the terminal—per the setup vignette—I do not see the correct paths:
unixODBC 2.3.12
DRIVERS............: /usr/local/etc/odbcinst.ini
SYSTEM DATA SOURCES: /usr/local/etc/odbc.ini
FILE DATA SOURCES..: /usr/local/etc/ODBCDataSources
USER DATA SOURCES..: /Users/<username>/.odbc.ini
SQLULEN Size.......: 8
SQLLEN Size........: 8
SQLSETPOSIROW Size.: 8
And if I cat
the first two .ini
files, they’re empty. The user data source .ini
simply does not exist.
After reading this GitHub issue, I realized I had to change my .Renviron
file to have it look for the right directory. Updated that to: ODBCSYSINI="/Users/<username>"
since Amazon suggested I move the .ini
to my home directory.
So now, into R. If I have it list the drivers, it reads the correct path where I know the driver exists:
> odbc::odbcListDrivers()
name attribute value
1 ODBC Drivers Amazon Redshift Installed
2 Amazon Redshift Description Amazon Redshift ODBC Driver
3 Amazon Redshift Driver /opt/amazon/redshift/lib/libamazonredshiftodbc.dylib
I’ve navigated there repeatedly and know that driver is there. So when I use the suggested Redshift format, I enter:
con <- dbConnect(
odbc(),
Driver = "Amazon Redshift",
servername = Sys.getenv("RS_URLS"),
database = "dev",
UID = Sys.getenv("RS_USER"),
PWD = Sys.getenv("RS_PASS"),
Port = 5439
)
And I receive the error:
Error: nanodbc/nanodbc.cpp:1138: 00000%0D%0A[Amazon][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function: [Amazon][Support] (50483) Could not load shared library, all attempted paths ("") failed
Googling around, I actually ended up at another Posit document—albeit for a different database—that was aimed at troubleshooting for Mac users. It suggests finding where libodbcinst.dylib
is and adding it to the configuration file for the driver. For me, that is at /opt/amazon/redshift/lib/
and it’s called amazon.redshiftodbc.ini
. Per the link above, I added the location to libodbcinst.dylib
and another line at the bottom of it:
ODBCInstLib=/usr/local/Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib
DriverManagerEncoding=UTF-16
There’s also another location for this file elsewhere in /usr/local/lib/
but it points to that Cellar
path:
libodbcinst.dylib -> ../Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib
Regardless of what path I put in the configuration file, I get the same error, but now it adds that path to the error message:
> con <- dbConnect(
+ odbc(),
+ Driver = "Amazon Redshift",
+ servername = Sys.getenv("RS_URLS"),
+ database = "dev",
+ UID = Sys.getenv("RS_USER"),
+ PWD = Sys.getenv("RS_PASS"),
+ Port = 5439
+ )
Error: nanodbc/nanodbc.cpp:1138: 00000
[Amazon][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function: [Amazon][Support] (50483) Could not load shared library, all attempted paths ("", "/usr/local/Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib") failed
>
I spent a lot of time fiddling around with .ini
files and where they point, etc., and can’t get anything beyond that error.
Attempt 4: RJDBC
{RJDBC}
is last, because I dislike depending on Java. However, the only “official” guide from Amazon on hooking up to R uses this package. It has me download an ancient driver, but let’s try.
I run this code, from the example:
install.packages("RJDBC")
library(RJDBC)
# download Amazon Redshift JDBC driver
download.file(
'http://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar',
'RedshiftJDBC41-1.1.9.1009.jar'
)
# connect to Amazon Redshift
driver <- JDBC(
"com.amazon.redshift.jdbc41.Driver",
"RedshiftJDBC41-1.1.9.1009.jar",
identifier.quote = "`"
)
And R immediately aborts the session. I found another guide for this package from 2017, and I run the suggested code therein… and R once again kills itself and aborts the session when I get to the JDBC()
call reading in the driver.
Summary
How do I connect R to Redshift?
(And why the hell is it this hard, when packages like {bigrquery}
make it trivially simple for other databases?)