I am working on a DR plan for Databricks in Azure.
‘Azure Databricks account admins can create one metastore for each region’
Databricks document doesn’t mention backup/regional failure scenario for metastore.
What if the metastore/databricks failed in one region, while the data/table in external storage is still available for reading or storage failover?
Users should be able to read storage (in another region).
But without a working metastore, how to achieve this?
1
I agree with @Ganesh Chandrasekaran
In an Azure Databricks Disaster Recovery (DR) plan, particularly regarding the metastore, it is important to note that while Azure Databricks is designed to be temporary—allowing compute resources to be terminated when not in use—the metastore is important for preserving the structure and metadata of your data warehouse tables.
In the case of a regional failure where the primary metastore becomes unavailable, accessing your data tables would be difficult, even though the data itself remains securely stored in geo-redundant storage and accessible from other regions.
you can use the setting up the external Hive metastore
Every Azure Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata.
You also have the option to use an existing external Hive metastore.
The below are the steps/Prerequisites
-> Two Databricks workspaces need to be created.
-> An Azure SQL Server and database need to be set up to store the Hive metastore .
-> A storage account (preferably ADLS Gen2) has been created to store table data.
-> A Service Principal need to be created, and the following details have to be noted:
- Application ID
- Application Secret
- Tenant ID
-> To create a Service Principal in the Azure Portal, follow the steps provided here.
-> Grant the Service Principal “Storage Blob Data Contributor” access to the storage account created in step 3.
Provide the below in the spark configuration:
spark.hadoop.javax.jdo.option.ConnectionUserName <sql user name>
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<azure sql server name>.database.windows.net:1433;database=< azure sql db name>;encrypt=true;trustServerCertificate=false;loginTimeout=30;
spark.hadoop.javax.jdo.option.ConnectionPassword <azure sql password>
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 2.3.7
To add the above spark configuration in Azure databricks go to Edit button on the cluster page.
Reference:
Sharing Metadata Across Different Databricks Workspaces Using Hive External Metastore
1