I am trying to connect to a table on Databricks from my local machine using PyCharm as my development environment for Python code. The table is located in a specific path within the Databricks filesystem.
I have set up a local configuration file to establish the connection, specifying the host, cluster_id, warehouse_id, port, token, and API version. The token and IDs are sensitive, so they have been anonymized.
My initial approach uses the Databricks libraries to access the workspace and interact with the clusters and filesystem. Here is a simplified and anonymized version of the code:
from databricks.sdk import WorkspaceClient
from databricks.connect import DatabricksSession
# Workspace client initialization with anonymized credentials
w = WorkspaceClient(
host="https://<databricks-instance>",
token="<access-token>"
)
# Listing cluster names
for c in w.clusters.list():
print(c.cluster_name)
# Listing files in a directory
d = w.dbutils.fs.ls('')
for f in d:
print(f.path)
This code works to get information about the clusters and files, showing that I can access the cluster.
However, when I try to read a CSV file from the Databricks filesystem with the following code:
spark = DatabricksSession.builder.getOrCreate()
file_path = "dbfs:/path/to/your/csvfile.csv"
df_spark = spark.read.format("csv").option("header", "true").load(file_path)
df = df_spark.toPandas()
I receive an error related to the cluster configuration:
pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "BAD_REQUEST: SingleClusterComputeMode(<cluster-id>) is not **Shared** or **Single User Cluster**. (requestId=<request-id>)"
It seems the problem is related to the type of cluster. Alternative attempts like connecting via JDBC result in the same error.
I also attempted to connect directly to the SQL Warehouse with the following code:
import pandas as pd
import pyodbc
# Connection details
server = 'https://<my_instance>.azuredatabricks.net'
database = '<my_database>'
username = '<my_email>'
password = '<my_password>'
driver = '{ODBC Driver 17 for SQL Server}'
port = 1433
# Connection string
connection_string = f'DRIVER={driver};SERVER={server};PORT={port};UID={username};PWD={password};DATABASE={database}'
# Establishing the connection
connection = pyodbc.connect(connection_string)
# Query example
query = 'SELECT * FROM "<my_table>" LIMIT 10'
df = pd.read_sql(query, connection)
# Closing the connection
connection.close()
print(df)
Are there any solutions or does it depend on the cluster’s settings and I simply cannot connect from local?
Thank you in advance for any help!