I have a terraform code to deploy a Databricks Workspace.
resource "azurerm_databricks_workspace" "databricks" {
resource_group_name = var.resource_group_name
location = var.context.location # West Europe
name = local.dbw_name
managed_resource_group_name = local.mrg_name
sku = "premium" # Needed for private endpoint
public_network_access_enabled = false
network_security_group_rules_required = "NoAzureDatabricksRules" # https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/private-link#--step-3-provision-an-azure-databricks-workspace-and-private-endpoints
custom_parameters {
no_public_ip = true # Security constrain
virtual_network_id = local.vnet_id
public_subnet_name = local.container_subnet_name
public_subnet_network_security_group_association_id = var.subnet_configuration_for_container.network_security_group_id
private_subnet_name = local.host_subnet_name
private_subnet_network_security_group_association_id = var.subnet_configuration_for_host.network_security_group_id
storage_account_name = local.st_name
}
tags = merge(local.default_tags, { managed_databricks = true })
lifecycle {
ignore_changes = [
tags["availability"],
tags["confidentiality"],
tags["integrity"],
tags["spoke_type"],
tags["traceability"],
]
precondition {
condition = length(local.dbw_name) < 64
error_message = "The Databricks resource name must be no longer than 64 characters. Please shorten the `instance` variable."
}
}
depends_on = [
data.azapi_resource.subnet["host"],
data.azapi_resource.subnet["container"]
]
}
We also have 2 private endpoints for the webapp dbw and webauth dbw. Both are then registered in our custom DNS so that the URL/IP of the host and container are accessible inside our network.
When deploying this code on subscription A, we have 0 issue. Cluster starts correclty, fast, no timeout.
But when deploying on subscription B and C that are identical as A, we have issues. The only difference between A/B/C are the name/ID (same tenantID), the policies are the same, the terraform code is the same, the DNS/firewall/proxy is the same.
On Subscription B/C when trying to start a cluster/job/dbt we have issue starting them. But not all the time.
Here is the error on the Databricks UI:
Failed to get instance bootstrap steps from the Databricks Control Plane.
Please check that instances have connectivity to the Databricks Control Plane.
Instance bootstrap failed command: GetRunbook Failure message: (Base64 encoded) XXXXXXXXXXXXXXX
VM extension code: ProvisioningState/succeeded instanceId:
InstanceId(aca79a0fb49e4c808700af118638e8ac)
workerEnv: workerenv-2477805373171457
Additional details (may be truncated): [Bootstrap Event] Command DownloadBootstrapScript finished.
Storage Account: arprodwesteua4.blob.core.windows.net [SUCCEEDED]. Seconds Elapsed: 4.69204 2024/07/10 09:09:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetToken finished. [SUCCEEDED]. Seconds Elapsed: 0.0390388965607 2024/07/10 09:09:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0142250061035 2024/07/10 09:10:11
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0148358345032 2024/07/10 09:10:30
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0136380195618 2024/07/10 09:10:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.014701128006 2024/07/10 09:11:25
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0169570446014 2024/07/10 09:12:12
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0156190395355 2024/07/10 09:13:31
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0194149017334 2024/07/10 09:15:55
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.014839887619 2024/07/10 09:20:26
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0163550376892 2024/07/10 09:20:41
INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetRunbook finished. [FAILED] . Seconds Elapsed: 646.803928137 2024/07/10 09:20:41
INFO vm_bootstrap.py:240: [Bootstrap Event] {FAILED_COMMAND:GetRunbook} 2024/07/10 09:20:41
INFO vm_bootstrap.py:242: [Bootstrap Event] {FAILED_MESSAGE:(Base64 encoded) XXXXXXXXXXX } 2024/07/10 09:20:41
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0184330940247
When starting a sparksession on my computer to reach the cluster I get this:
24/07/11 08:32:25 WARN HTTPClient: Excluding proxy hosts for HTTP client based on env var no_proxy=localhost,10.0.0.0/8,storageA.blob.core.windows.net,storageB.blob.core.windows.net,storageC.blob.core.windows.net,storageD.blob.core.windows.net,storageE.blob.core.windows.net,storageF.blob.core.windows.net,storageG.blob.core.windows.net,.dev.azuresynapse.net,.azuresynapse.net,.table.core.windows.net,.queue.core.windows.net,.file.core.windows.net,.web.core.windows.net,.dfs.core.windows.net,.documents.azure.com,.batch.azure.com,.service.batch.azure.com,.vault.azure.net,.vaultcore.azure.net,.managedhsm.azure.net,.azmk8s.io,.search.windows.net,.azurecr.io,.azconfig.io,.servicebus.windows.net,.azure-devices.net,.servicebus.windows.net,.azure-devices-provisioning.net,.eventgrid.azure.net,.azurewebsites.net,.scm.azurewebsites.net,.api.azureml.ms,.notebooks.azure.net,.instances.azureml.ms,.aznbcontent.net,.inference.ml.azure.com,.cognitiveservices.azure.com,.afs.azure.net,.datafactory.azure.net,.adf.azure.com,.purview.azure.com,.azure-api.net,.developer.azure-api.net,.analysis.windows.net,.azuredatabricks.net,.dev.azure.com,.azurefd.net,.vsblob.vsassets.io,otr.dtc3.cf.saint-gobain.net,.openai.azure.com
24/07/11 08:32:26 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state TERMINATED, waiting for it to start running...
24/07/11 08:32:37 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state PENDING, waiting for it to start running...
24/07/11 08:46:22 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state PENDING, waiting for it to start running...
24/07/11 08:46:32 ERROR SparkClientManager: Fail to get the SparkClient
java.util.concurrent.ExecutionException: com.databricks.service.SparkServiceConnectionException: Invalid cluster ID: "xxxx-xxxxxx-xxxxxxxx"
The cluster ID you specified does not correspond to any existing cluster.
Cluster ID: The ID of the cluster on which you want to run your code
- This should look like 0123-456789-abcd012
- Get current value: spark.conf.get("spark.databricks.service.clusterId")
- Set via conf: spark.conf.set("spark.databricks.service.clusterId", <your cluster ID>)
- Set via environment variable: export DATABRICKS_CLUSTER_ID=<your cluster ID>
at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at org.sparkproject.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2193)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:3932)
at org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)
at com.databricks.service.SparkClientManager.liftedTree1$1(SparkClient.scala:377)
at com.databricks.service.SparkClientManager.getForSession(SparkClient.scala:376)
at com.databricks.service.SparkClientManager.getForSession$(SparkClient.scala:353)
at com.databricks.service.SparkClientManager$.getForSession(SparkClient.scala:401)
at com.databricks.service.SparkClientManager.getForCurrentSession(SparkClient.scala:351)
at com.databricks.service.SparkClientManager.getForCurrentSession$(SparkClient.scala:351)
at com.databricks.service.SparkClientManager$.getForCurrentSession(SparkClient.scala:401)
at com.databricks.service.SparkClient$.getServerHadoopConf(SparkClient.scala:297)
at com.databricks.spark.util.SparkClientContext$.getServerHadoopConf(SparkClientContext.scala:281)
at org.apache.spark.SparkContext.$anonfun$hadoopConfiguration$1(SparkContext.scala:407)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.SparkContext.hadoopConfiguration(SparkContext.scala:398)
at com.databricks.sql.DatabricksEdge.catalog(DatabricksEdge.scala:198)
at com.databricks.sql.DatabricksEdge.catalog$(DatabricksEdge.scala:197)
at org.apache.spark.sql.internal.SessionStateBuilder.catalog$lzycompute(SessionState.scala:179)
at org.apache.spark.sql.internal.SessionStateBuilder.catalog(SessionState.scala:179)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:190)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:190)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:193)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:192)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.<init>(BaseSessionStateBuilder.scala:208)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:208)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$7(BaseSessionStateBuilder.scala:427)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:106)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:106)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:171)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:24)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:352)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$4(QueryExecution.scala:393)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:821)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:393)
at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:389)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:389)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:165)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:165)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:155)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:100)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
at org.apache.spark.sql.SparkSession.$anonfun$withActiveAndFrameProfiler$1(SparkSession.scala:1080)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:24)
at org.apache.spark.sql.SparkSession.withActiveAndFrameProfiler(SparkSession.scala:1080)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:811)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:835)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:578)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: com.databricks.service.SparkServiceConnectionException: Invalid cluster ID: "xxxx-xxxxxx-xxxxxxxx"
Sometimes we get errors like this:
24/07/11 08:13:28 WARN SparkServiceRPCClient: Fatal connection error for RPC 1257ff66-7657-46de-8c8a-28bad097c6b9
Traceback (most recent call last):
File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/main.py", line 67, in <module>
main()
File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/main.py", line 31, in main
metric = count_check.run(
File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/core/count_check.py", line 53, in run
current_df = reduce(
File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/core/count_check.py", line 56, in <genexpr>
spark.table(table_info.table_name.format(settings.deploy_env))
File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/instrumentation_utils.py", line 48, in wrapper
res = func(*args, **kwargs)
File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/sql/session.py", line 1423, in table
return DataFrame(self._jsparkSession.table(tableName), self)
File "/azureml-envs/python3.10/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/errors/exceptions.py", line 228, in deco
return f(*a, **kw)
File "/azureml-envs/python3.10/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o26.table.
: com.databricks.service.SparkServiceConnectionException: Request failed with HTTP 404
Client information:
Shard address: "https://adb-xxxxxxxxxxxx.xx.azuredatabricks.net"
Cluster ID: "xxxx-xxxxxx-xxxxxxxx"
Port: 15001
Token ID: "xxxxxxxxxxxxxxx"
Org ID: xxxxxxxxxxxx
Response:
Tunnel f155a135180b448bba783398a66a1878.workerenv-2477805373171457.mux.ngrok-dataplane.wildcard not found
at com.databricks.service.SparkServiceRPCClient.handleResponse(SparkServiceRPCClient.scala:134)
at com.databricks.service.SparkServiceRPCClient.doPost(SparkServiceRPCClient.scala:112)
We have no clue on why the errors are so random. We did a telnet to our firewall/proxy/dns and it is working. All ports for the Workspaces A/B/C are opened the same way. All of them are in no_public_ip
.
I’m not sure why this Control Plane
is not reached. And when it is working, sometimes we have timeout when reading/writting on DBFS + ABFS (mounting point correcly configured)