Here is what I did.
- Created an XML file
xmlPath = "dbfs:/mnt/books.xml"
xmlString = """
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
</book>"""
dbutils.fs.put(xmlPath, xmlString, True)
- Created an XSD file
xsd_Path = "dbfs:/mnt/books.xsd"
xsd_String = """<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="book">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="author" type="xsd:string" />
<xsd:element name="title" type="xsd:string" />
</xsd:sequence>
<xsd:attribute name="id" type="xsd:string" use="required" />
</xsd:complexType>
</xsd:element>
</xsd:schema>"""
dbutils.fs.put(xsd_Path, xsd_String,True)
- Reading the file with option rowValidationXSDPath
df = (spark.read
.format("xml")
.option("rowTag", "book")
.option("rowValidationXSDPath", xsd_Path)
.load(xmlPath))
df.printSchema()
-
Getting error message
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 4 times, most recent failure: Lost task 0.3 in stage 30.0 (TID 123) (10.139.64.10 executor driver): java.util.concurrent.ExecutionException: org.xml.sax.SAXParseException; schema_reference.4: Failed to read schema document ‘file:/local_disk0/spark-*************/dbfs:/mnt/books.xsd’, because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>
Please assist to fix the error.
Rabindra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.