I am new using apache spark with data bricks. I’m trying to read an xml file and process it.
My xml looks like this:
<code><?xml version="1.0" encoding="utf-16"?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<header creationtool="value" creationtoolversion="value" segtype="value" adminlang="en-us" creationid="value" srclang="en" o-tmf="value" datatype="unknown">
<prop type="defclient"> </prop>
<prop type="defproject"> </prop>
<prop type="defdomain"> </prop>
<prop type="defsubject"> </prop>
<prop type="description"> </prop>
<prop type="targetlang">fr</prop>
<prop type="name">name</prop>
<tu changedate="20170710T192020Z" creationdate="20170710T145400Z" creationid="name" changeid="name">
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">document</prop>
<code><?xml version="1.0" encoding="utf-16"?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
<header creationtool="value" creationtoolversion="value" segtype="value" adminlang="en-us" creationid="value" srclang="en" o-tmf="value" datatype="unknown">
<prop type="defclient"> </prop>
<prop type="defproject"> </prop>
<prop type="defdomain"> </prop>
<prop type="defsubject"> </prop>
<prop type="description"> </prop>
<prop type="targetlang">fr</prop>
<prop type="name">name</prop>
</header>
<body>
<tu changedate="20170710T192020Z" creationdate="20170710T145400Z" creationid="name" changeid="name">
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">document</prop>
<tuv xml:lang="en">
<prop type="value</prop>
<seg>value</seg>
</tuv>
<tuv xml:lang="fr">
<seg>value</seg>
</tuv>
</tu>
</body>
</tmx>
</code>
<?xml version="1.0" encoding="utf-16"?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
<header creationtool="value" creationtoolversion="value" segtype="value" adminlang="en-us" creationid="value" srclang="en" o-tmf="value" datatype="unknown">
<prop type="defclient"> </prop>
<prop type="defproject"> </prop>
<prop type="defdomain"> </prop>
<prop type="defsubject"> </prop>
<prop type="description"> </prop>
<prop type="targetlang">fr</prop>
<prop type="name">name</prop>
</header>
<body>
<tu changedate="20170710T192020Z" creationdate="20170710T145400Z" creationid="name" changeid="name">
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">document</prop>
<tuv xml:lang="en">
<prop type="value</prop>
<seg>value</seg>
</tuv>
<tuv xml:lang="fr">
<seg>value</seg>
</tuv>
</tu>
</body>
</tmx>
I can’t handle the prop tags inside the header from those inside the body tag.
I defined the following schema and tried to separate the header tag from the body tag creating two output files
` StructType propSchema = new StructType()
.add(“_VALUE”, DataTypes.StringType)
.add(“_type”, DataTypes.StringType);
<code>StructType header = new StructType()
.add("_creationtool", DataTypes.StringType)
.add("_creationtoolversion", DataTypes.StringType)
.add("_segtype", DataTypes.StringType)
.add("_adminlang", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_srclang", DataTypes.StringType)
.add("_o-tmf", DataTypes.StringType)
.add("_datatype", DataTypes.StringType)
.add("prop",DataTypes.createArrayType(propSchema));
StructType segSchema = new StructType()
.add("_VALUE", DataTypes.StringType)
.add("bpt", DataTypes.createArrayType(new StructType()
.add("_i", DataTypes.StringType)
.add("_type", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType)))
.add("ept", DataTypes.createArrayType(new StructType()
.add("_i", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType)));
StructType tuvSchema = new StructType()
.add("_xml:lang", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType, true)
.add("prop", DataTypes.createArrayType(propSchema))
.add("seg", DataTypes.createArrayType(segSchema));
StructType tuSchema = new StructType()
.add("_changedate", DataTypes.StringType)
.add("_creationdate", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_changeid", DataTypes.StringType)
.add("prop", DataTypes.createArrayType(propSchema))
.add("tuv", DataTypes.createArrayType(tuvSchema));
StructType schema = new StructType()
.add("header", new StructType()
.add("prop", DataTypes.createArrayType(propSchema))
.add("_creationtool", DataTypes.StringType)
.add("_creationtoolversion", DataTypes.StringType)
.add("_segtype", DataTypes.StringType)
.add("_adminlang", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_srclang", DataTypes.StringType)
.add("_o-tmf", DataTypes.StringType)
.add("_datatype", DataTypes.StringType))
.add("body", new StructType()
.add("tu", DataTypes.createArrayType(tuSchema)));
Dataset<Row> xmlData = spark.read()
.format("com.databricks.spark.xml")
.option("rootTag", "tmx")
.option("declaration", "foo")
Dataset<Row> xmlHeaderData = spark.read()
.format("com.databricks.spark.xml")
.option("rowTag", "prop")
.option("rootTag", "tmx")
.option("declaration", "foo")
xmlHeaderData.printSchema();
xmlHeaderData = xmlHeaderData.selectExpr(
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.xml")
.option("rootTag", "tmx")
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.xml")
.option("rootTag", "header")
.option("rowTag", "prop")
<code>StructType header = new StructType()
.add("_creationtool", DataTypes.StringType)
.add("_creationtoolversion", DataTypes.StringType)
.add("_segtype", DataTypes.StringType)
.add("_adminlang", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_srclang", DataTypes.StringType)
.add("_o-tmf", DataTypes.StringType)
.add("_datatype", DataTypes.StringType)
.add("prop",DataTypes.createArrayType(propSchema));
StructType segSchema = new StructType()
.add("_VALUE", DataTypes.StringType)
.add("bpt", DataTypes.createArrayType(new StructType()
.add("_i", DataTypes.StringType)
.add("_type", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType)))
.add("ept", DataTypes.createArrayType(new StructType()
.add("_i", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType)));
StructType tuvSchema = new StructType()
.add("_xml:lang", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType, true)
.add("prop", DataTypes.createArrayType(propSchema))
.add("seg", DataTypes.createArrayType(segSchema));
StructType tuSchema = new StructType()
.add("_changedate", DataTypes.StringType)
.add("_creationdate", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_changeid", DataTypes.StringType)
.add("prop", DataTypes.createArrayType(propSchema))
.add("tuv", DataTypes.createArrayType(tuvSchema));
StructType schema = new StructType()
.add("header", new StructType()
.add("prop", DataTypes.createArrayType(propSchema))
.add("_creationtool", DataTypes.StringType)
.add("_creationtoolversion", DataTypes.StringType)
.add("_segtype", DataTypes.StringType)
.add("_adminlang", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_srclang", DataTypes.StringType)
.add("_o-tmf", DataTypes.StringType)
.add("_datatype", DataTypes.StringType))
.add("body", new StructType()
.add("tu", DataTypes.createArrayType(tuSchema)));
Dataset<Row> xmlData = spark.read()
.format("com.databricks.spark.xml")
.option("rowTag", "tu")
.option("rootTag", "tmx")
.option("declaration", "foo")
.load(inputPath);
Dataset<Row> xmlHeaderData = spark.read()
.format("com.databricks.spark.xml")
.option("rowTag", "prop")
.option("rootTag", "tmx")
.option("declaration", "foo")
.schema(schema)
.load(inputPath);
xmlHeaderData.printSchema();
xmlHeaderData = xmlHeaderData.selectExpr(
"header.prop as prop");
xmlData.show(false);
xmlData.printSchema();
xmlData.write()
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.xml")
.option("rootTag", "tmx")
.option("rowTag", "tu")
.save(outputPath);
xmlHeaderData.write()
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.xml")
.option("rootTag", "header")
.option("rowTag", "prop")
.save(outputPath+"1");`
</code>
StructType header = new StructType()
.add("_creationtool", DataTypes.StringType)
.add("_creationtoolversion", DataTypes.StringType)
.add("_segtype", DataTypes.StringType)
.add("_adminlang", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_srclang", DataTypes.StringType)
.add("_o-tmf", DataTypes.StringType)
.add("_datatype", DataTypes.StringType)
.add("prop",DataTypes.createArrayType(propSchema));
StructType segSchema = new StructType()
.add("_VALUE", DataTypes.StringType)
.add("bpt", DataTypes.createArrayType(new StructType()
.add("_i", DataTypes.StringType)
.add("_type", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType)))
.add("ept", DataTypes.createArrayType(new StructType()
.add("_i", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType)));
StructType tuvSchema = new StructType()
.add("_xml:lang", DataTypes.StringType)
.add("_VALUE", DataTypes.StringType, true)
.add("prop", DataTypes.createArrayType(propSchema))
.add("seg", DataTypes.createArrayType(segSchema));
StructType tuSchema = new StructType()
.add("_changedate", DataTypes.StringType)
.add("_creationdate", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_changeid", DataTypes.StringType)
.add("prop", DataTypes.createArrayType(propSchema))
.add("tuv", DataTypes.createArrayType(tuvSchema));
StructType schema = new StructType()
.add("header", new StructType()
.add("prop", DataTypes.createArrayType(propSchema))
.add("_creationtool", DataTypes.StringType)
.add("_creationtoolversion", DataTypes.StringType)
.add("_segtype", DataTypes.StringType)
.add("_adminlang", DataTypes.StringType)
.add("_creationid", DataTypes.StringType)
.add("_srclang", DataTypes.StringType)
.add("_o-tmf", DataTypes.StringType)
.add("_datatype", DataTypes.StringType))
.add("body", new StructType()
.add("tu", DataTypes.createArrayType(tuSchema)));
Dataset<Row> xmlData = spark.read()
.format("com.databricks.spark.xml")
.option("rowTag", "tu")
.option("rootTag", "tmx")
.option("declaration", "foo")
.load(inputPath);
Dataset<Row> xmlHeaderData = spark.read()
.format("com.databricks.spark.xml")
.option("rowTag", "prop")
.option("rootTag", "tmx")
.option("declaration", "foo")
.schema(schema)
.load(inputPath);
xmlHeaderData.printSchema();
xmlHeaderData = xmlHeaderData.selectExpr(
"header.prop as prop");
xmlData.show(false);
xmlData.printSchema();
xmlData.write()
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.xml")
.option("rootTag", "tmx")
.option("rowTag", "tu")
.save(outputPath);
xmlHeaderData.write()
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.xml")
.option("rootTag", "header")
.option("rowTag", "prop")
.save(outputPath+"1");`
My first output file looks like this:
<tu changedate="20170710T192020Z" changeid="Myriam.Legault" creationdate="20170710T145400Z" creationid="Myriam.Legault">
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">Y:CHU ST Justine Centre de formulations pédiatriques51713 ST Justine51713-2017 06 29 Non Confid GPFC_ Annual Report.docx</prop>
<prop type="x-context-post"><seg>2016-2017 Annual Report</seg></prop>
<seg><bpt i="1" type="bold">{}</bpt><ept i="1">{}</ept></seg>
<seg><bpt i="1" type="bold">{}</bpt><ept i="1">{}</ept></seg>
<code><tmx>
<tu changedate="20170710T192020Z" changeid="Myriam.Legault" creationdate="20170710T145400Z" creationid="Myriam.Legault">
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">Y:CHU ST Justine Centre de formulations pédiatriques51713 ST Justine51713-2017 06 29 Non Confid GPFC_ Annual Report.docx</prop>
<tuv lang="en">
<prop type="x-context-post"><seg>2016-2017 Annual Report</seg></prop>
<seg><bpt i="1" type="bold">{}</bpt><ept i="1">{}</ept></seg>
</tuv>
<tuv lang="fr">
<seg><bpt i="1" type="bold">{}</bpt><ept i="1">{}</ept></seg>
</tuv>
</tu>
</tmx>
</code>
<tmx>
<tu changedate="20170710T192020Z" changeid="Myriam.Legault" creationdate="20170710T145400Z" creationid="Myriam.Legault">
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">Y:CHU ST Justine Centre de formulations pédiatriques51713 ST Justine51713-2017 06 29 Non Confid GPFC_ Annual Report.docx</prop>
<tuv lang="en">
<prop type="x-context-post"><seg>2016-2017 Annual Report</seg></prop>
<seg><bpt i="1" type="bold">{}</bpt><ept i="1">{}</ept></seg>
</tuv>
<tuv lang="fr">
<seg><bpt i="1" type="bold">{}</bpt><ept i="1">{}</ept></seg>
</tuv>
</tu>
</tmx>
My second output file looks like this:
<code><header>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
</header>
</code>
<header>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
<prop></prop>
</header>
I need to be able to write and convert tags (keeping some and modify them ) from the original to a new xml.
I’m using the following:
Java version = 1.8
Maven version = 3.8.1
Databricks
artifactId= spark-xml_2.11
version=0.10.0
Apache spark
artifactId=spark-core_2.11
version=2.4.8
Spark SQL
artifactId=spark-xml_2.11
version=0.10.0