I’m executing a databricks job using the notebook
task in Azure Data Factory. My data is landing successfully following a previously executed copy
activity, however when I try to access this data in Databricks I am getting the following error:
Py4JJavaError: An error occurred while calling o393.load.
: Failure to initialize configuration for storage account $DATALAKENAME.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:52)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:715)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:2084)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:272)
...
Setup steps:
- I have defined a Storage Credential in Unity Catalog using the Databricks Access Connector.
- I have defined an external location over the
abfss://
filepath I am attempting to read data from. - I have granted my user account and the Data Factory MSI
READ FILES
andWRITE FILES
permission on the external location - I have granted the Databricks Access Connector
Storage Blob Data Contributor
role on the storage account - I have granted the Data Factory
Contributor
role on the Databricks workspace
I am able to read the data when I run the notebook interactively under my user context, and am only receiving this error when I execute the notebook from ADF, so I know that I am able to successfully authenticate to storage using the access connector/external location. I do not want to enable access key based auth to the storage account as a work around as suggested in this similar post.
Root cause
The root of the issue was that I initially encountered the error because Unity Catalog was not configured on the workspace. After attaching the workspace to the Unity Catalog metastore, and setting up the storage credential/external location etc, I was attempting to validate that the new configuration was working by using the Rerun
task function in the Data Factory Pipeline:
When running the task using this method, data factory used the existing job cluster definition which did not have Unity Catalog enabled:
Solution
To solve the issue, I changed the Databricks Linked Service configuration and set the Unity Catalog Access Mode
to Assigned
I then ran the job in a new run – the notebook execution was successful. My sense is it was the execution of the pipeline as a new job after enabling Unity Catalog on the workspace, rather than the Unity Catalog Access Mode configuration, that fixed the issue.