I am using autoloader to ingest csv files from a managed volume.
When I run the DLT pipeline, I get below error:
# Define variables used in code below
file_path = "/Volumes/CAD/default/raw_files/"
username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]
table_name_bronze = f"raw_prod_data1"
table_name_silver = f"processed_prod_data1"
checkpoint_path = f"/tmp/{username}/_checkpoint/incr_batch"
badrecords_path = f"/Volumes/CAD/default/managedvolume/_badrecords/"
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("multiline", "true")
.option("escape", '"')
.option("badRecordsPath", badrecords_path)
.option("cloudFiles.schemaEvolutionMode", "rescue")
.option("cloudFiles.schemaLocation", checkpoint_path)
.option("pathGlobfilter", "*.csv") #read only csv files
<b>.load(file_path)</b>
.select("*", col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"))
.writeStream
.option("checkpointLocation", checkpoint_path)
AnalysisException: [RequestId=d898c615-9e0b-4f26-9e89-ab79636c701a ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP]
Input path url 'abfss://unity-catalog-storage@dbstorages53opsvfqrzvi.dfs.core.windows.net/1023541510876078/__unitystorage/schemas/02470915-7c4c-41d5-881e-e2a850a26d80/tables/414d4694-bde7-45db-a9ae-5a731b10b2a6/_dlt_metadata/_autoloader' overlaps with managed storage within 'GenerateTemporaryPathCredential' call. .,None,Map(),Map(),List(),List(),Map())
file_path and badrecords_path are managed locations.
Based on the link (https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/paths?source=recommendations) I am aware of the restrictions in unity catalog and as far as I can tell there is no overlap in checkpoint file or schema inference paths. I might be wrong though.
Could someone please tell me what exactly does the error mean and how would I go about troubleshooting it?
Can someone also let me know where will the data files for tables be located?
Thanks.
1
Please make sure you are checking all this conditions mentioned here.
Unity Catalog manages specific storage locations, and DLT paths used for schema evolution, checkpoints, and bad records must not overlap with any Unity Catalog storage or managed paths.
Especially, below restrictions.
- External locations cannot overlap other external locations.
- Tables and volumes store data files in external locations or the metastore root location.
- Tables and volumes cannot overlap each other.
- Managed storage locations cannot overlap each other.
- External volumes cannot overlap managed storage locations.
- External tables cannot overlap managed storage locations.
- You cannot define an external location within another external location.
- You cannot define a table within another table.
- You cannot define a table on any data files or directories within a volume.
You were saying file_path
and badrecords_path
are managed locations, so check whether anything overlapping there.
Also, check the writeStream
you are doing doesn’t overlapping with any other managed locations, table or volume.
Basically, if any external location you are using in root of the managed storage location you will get this error, check this to know more about it.