Thiết kế website giá rẻ

Question

I have a terraform code to deploy a Databricks Workspace.

resource "azurerm_databricks_workspace" "databricks" {
  resource_group_name = var.resource_group_name
  location            = var.context.location # West Europe

  name                        = local.dbw_name
  managed_resource_group_name = local.mrg_name

  sku = "premium" # Needed for private endpoint

  public_network_access_enabled         = false
  network_security_group_rules_required = "NoAzureDatabricksRules" # https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/private-link#--step-3-provision-an-azure-databricks-workspace-and-private-endpoints

  custom_parameters {
    no_public_ip                                         = true # Security constrain 
    virtual_network_id                                   = local.vnet_id
    public_subnet_name                                   = local.container_subnet_name
    public_subnet_network_security_group_association_id  = var.subnet_configuration_for_container.network_security_group_id
    private_subnet_name                                  = local.host_subnet_name
    private_subnet_network_security_group_association_id = var.subnet_configuration_for_host.network_security_group_id
    storage_account_name                                 = local.st_name
  }

  tags = merge(local.default_tags, { managed_databricks = true })

  lifecycle {
    ignore_changes = [
      tags["availability"],
      tags["confidentiality"],
      tags["integrity"],
      tags["spoke_type"],
      tags["traceability"],
    ]
    precondition {
      condition     = length(local.dbw_name) < 64
      error_message = "The Databricks resource name must be no longer than 64 characters. Please shorten the `instance` variable."
    }
  }

  depends_on = [
    data.azapi_resource.subnet["host"],
    data.azapi_resource.subnet["container"]
  ]
}

We also have 2 private endpoints for the webapp dbw and webauth dbw. Both are then registered in our custom DNS so that the URL/IP of the host and container are accessible inside our network.

When deploying this code on subscription A, we have 0 issue. Cluster starts correclty, fast, no timeout.
But when deploying on subscription B and C that are identical as A, we have issues. The only difference between A/B/C are the name/ID (same tenantID), the policies are the same, the terraform code is the same, the DNS/firewall/proxy is the same.

On Subscription B/C when trying to start a cluster/job/dbt we have issue starting them. But not all the time.
Here is the error on the Databricks UI:

Failed to get instance bootstrap steps from the Databricks Control Plane.
Please check that instances have connectivity to the Databricks Control Plane.
Instance bootstrap failed command: GetRunbook Failure message: (Base64 encoded) XXXXXXXXXXXXXXX
VM extension code: ProvisioningState/succeeded instanceId:
InstanceId(aca79a0fb49e4c808700af118638e8ac)
workerEnv: workerenv-2477805373171457
Additional details (may be truncated): [Bootstrap Event] Command DownloadBootstrapScript finished.
Storage Account: arprodwesteua4.blob.core.windows.net [SUCCEEDED]. Seconds Elapsed: 4.69204 2024/07/10 09:09:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetToken finished. [SUCCEEDED]. Seconds Elapsed: 0.0390388965607 2024/07/10 09:09:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0142250061035 2024/07/10 09:10:11
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0148358345032 2024/07/10 09:10:30
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0136380195618 2024/07/10 09:10:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.014701128006 2024/07/10 09:11:25
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0169570446014 2024/07/10 09:12:12
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0156190395355 2024/07/10 09:13:31
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0194149017334 2024/07/10 09:15:55
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.014839887619 2024/07/10 09:20:26
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0163550376892 2024/07/10 09:20:41
INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetRunbook finished. [FAILED] . Seconds Elapsed: 646.803928137 2024/07/10 09:20:41
INFO vm_bootstrap.py:240: [Bootstrap Event] {FAILED_COMMAND:GetRunbook} 2024/07/10 09:20:41 
INFO vm_bootstrap.py:242: [Bootstrap Event] {FAILED_MESSAGE:(Base64 encoded) XXXXXXXXXXX } 2024/07/10 09:20:41
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0184330940247

When starting a sparksession on my computer to reach the cluster I get this:

24/07/11 08:32:25 WARN HTTPClient: Excluding proxy hosts for HTTP client based on env var no_proxy=localhost,10.0.0.0/8,storageA.blob.core.windows.net,storageB.blob.core.windows.net,storageC.blob.core.windows.net,storageD.blob.core.windows.net,storageE.blob.core.windows.net,storageF.blob.core.windows.net,storageG.blob.core.windows.net,.dev.azuresynapse.net,.azuresynapse.net,.table.core.windows.net,.queue.core.windows.net,.file.core.windows.net,.web.core.windows.net,.dfs.core.windows.net,.documents.azure.com,.batch.azure.com,.service.batch.azure.com,.vault.azure.net,.vaultcore.azure.net,.managedhsm.azure.net,.azmk8s.io,.search.windows.net,.azurecr.io,.azconfig.io,.servicebus.windows.net,.azure-devices.net,.servicebus.windows.net,.azure-devices-provisioning.net,.eventgrid.azure.net,.azurewebsites.net,.scm.azurewebsites.net,.api.azureml.ms,.notebooks.azure.net,.instances.azureml.ms,.aznbcontent.net,.inference.ml.azure.com,.cognitiveservices.azure.com,.afs.azure.net,.datafactory.azure.net,.adf.azure.com,.purview.azure.com,.azure-api.net,.developer.azure-api.net,.analysis.windows.net,.azuredatabricks.net,.dev.azure.com,.azurefd.net,.vsblob.vsassets.io,otr.dtc3.cf.saint-gobain.net,.openai.azure.com
24/07/11 08:32:26 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state TERMINATED, waiting for it to start running...
24/07/11 08:32:37 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state PENDING, waiting for it to start running...
24/07/11 08:46:22 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state PENDING, waiting for it to start running...
24/07/11 08:46:32 ERROR SparkClientManager: Fail to get the SparkClient
java.util.concurrent.ExecutionException: com.databricks.service.SparkServiceConnectionException: Invalid cluster ID: "xxxx-xxxxxx-xxxxxxxx"

The cluster ID you specified does not correspond to any existing cluster.
Cluster ID: The ID of the cluster on which you want to run your code
  - This should look like 0123-456789-abcd012
  - Get current value: spark.conf.get("spark.databricks.service.clusterId")
  - Set via conf: spark.conf.set("spark.databricks.service.clusterId", <your cluster ID>)
  - Set via environment variable: export DATABRICKS_CLUSTER_ID=<your cluster ID>
      
    at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
    at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
    at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
    at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
    at org.sparkproject.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
    at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
    at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
    at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2193)
    at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:3932)
    at org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)
    at com.databricks.service.SparkClientManager.liftedTree1$1(SparkClient.scala:377)
    at com.databricks.service.SparkClientManager.getForSession(SparkClient.scala:376)
    at com.databricks.service.SparkClientManager.getForSession$(SparkClient.scala:353)
    at com.databricks.service.SparkClientManager$.getForSession(SparkClient.scala:401)
    at com.databricks.service.SparkClientManager.getForCurrentSession(SparkClient.scala:351)
    at com.databricks.service.SparkClientManager.getForCurrentSession$(SparkClient.scala:351)
    at com.databricks.service.SparkClientManager$.getForCurrentSession(SparkClient.scala:401)
    at com.databricks.service.SparkClient$.getServerHadoopConf(SparkClient.scala:297)
    at com.databricks.spark.util.SparkClientContext$.getServerHadoopConf(SparkClientContext.scala:281)
    at org.apache.spark.SparkContext.$anonfun$hadoopConfiguration$1(SparkContext.scala:407)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at org.apache.spark.SparkContext.hadoopConfiguration(SparkContext.scala:398)
    at com.databricks.sql.DatabricksEdge.catalog(DatabricksEdge.scala:198)
    at com.databricks.sql.DatabricksEdge.catalog$(DatabricksEdge.scala:197)
    at org.apache.spark.sql.internal.SessionStateBuilder.catalog$lzycompute(SessionState.scala:179)
    at org.apache.spark.sql.internal.SessionStateBuilder.catalog(SessionState.scala:179)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:190)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:190)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:193)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:192)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.<init>(BaseSessionStateBuilder.scala:208)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:208)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$7(BaseSessionStateBuilder.scala:427)
    at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:106)
    at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:106)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:171)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:24)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:352)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$4(QueryExecution.scala:393)
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:821)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:393)
    at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:389)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:389)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:165)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:165)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:155)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:100)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
    at org.apache.spark.sql.SparkSession.$anonfun$withActiveAndFrameProfiler$1(SparkSession.scala:1080)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:24)
    at org.apache.spark.sql.SparkSession.withActiveAndFrameProfiler(SparkSession.scala:1080)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
    at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:811)
    at org.apache.spark.sql.SparkSession.table(SparkSession.scala:835)
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.base/java.lang.reflect.Method.invoke(Method.java:578)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:306)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
    at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: com.databricks.service.SparkServiceConnectionException: Invalid cluster ID: "xxxx-xxxxxx-xxxxxxxx"

Sometimes we get errors like this:

24/07/11 08:13:28 WARN SparkServiceRPCClient: Fatal connection error for RPC 1257ff66-7657-46de-8c8a-28bad097c6b9
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/main.py", line 67, in <module>
    main()
  File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/main.py", line 31, in main
    metric = count_check.run(
  File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/core/count_check.py", line 53, in run
    current_df = reduce(
  File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/core/count_check.py", line 56, in <genexpr>
    spark.table(table_info.table_name.format(settings.deploy_env))
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/instrumentation_utils.py", line 48, in wrapper
    res = func(*args, **kwargs)
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/sql/session.py", line 1423, in table
    return DataFrame(self._jsparkSession.table(tableName), self)
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/errors/exceptions.py", line 228, in deco
    return f(*a, **kw)
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o26.table.
: com.databricks.service.SparkServiceConnectionException: Request failed with HTTP 404
Client information:
Shard address: "https://adb-xxxxxxxxxxxx.xx.azuredatabricks.net"
Cluster ID: "xxxx-xxxxxx-xxxxxxxx"
Port: 15001
Token ID: "xxxxxxxxxxxxxxx"
Org ID: xxxxxxxxxxxx
     
Response:
Tunnel f155a135180b448bba783398a66a1878.workerenv-2477805373171457.mux.ngrok-dataplane.wildcard not found
    at com.databricks.service.SparkServiceRPCClient.handleResponse(SparkServiceRPCClient.scala:134)
    at com.databricks.service.SparkServiceRPCClient.doPost(SparkServiceRPCClient.scala:112)

We have no clue on why the errors are so random. We did a telnet to our firewall/proxy/dns and it is working. All ports for the Workspaces A/B/C are opened the same way. All of them are in no_public_ip.
I’m not sure why this Control Plane is not reached. And when it is working, sometimes we have timeout when reading/writting on DBFS + ABFS (mounting point correcly configured)

Thiết kế website giá rẻ

Danh mục

Azure Databricks “Failed to get instance bootstrap steps from the Databricks Control Plane”