What is the best approach to parse Complex Large XML of 1.5 GB and save it into DB in Scala using Spark. Below is my sample XML for your reference.
<cardataresponse>
<responsetype>Success</responsetype>
<requestid>12345</requestid>
<requestsystem>CarDataSystem</requestsystem>
<requestuser>JohnDoe</requestuser>
<responseattribute>
<runid>001</runid>
<runtime>2024-07-04T12:00:00Z</runtime>
<userid>john_doe</userid>
<date>2024-07-04</date>
<status>Completed</status>
<scenariodatalist>
<scenariodata>
<sname>Car Information</sname>
<sid>001A</sid>
<sversion>1.0</sversion>
<sstatus>Active</sstatus>
<sdatefrom>2024-07-01</sdatefrom>
<variablelist>
<variable>
<identifier>
<id>car001</id>
<version>1</version>
</identifier>
<variabletype>Sedan</variabletype>
<variablevalue>Available</variablevalue>
<displayvariablevale>Yes</displayvariablevale>
<displaytype>Text</displaytype>
<unit>None</unit>
<attributes>
<color>
<type>Exterior</type>
<value>Red</value>
<description>Bright Red Exterior</description>
</color>
<color>
<type>Interior</type>
<value>Black</value>
<description>Black Leather Interior</description>
</color>
<color>
<type>Exterior</type>
<value>Blue</value>
<description>Metallic Blue Exterior</description>
</color>
<color>
<type>Interior</type>
<value>White</value>
<description>White Fabric Interior</description>
</color>
<color>
<type>Exterior</type>
<value>Green</value>
<description>Forest Green Exterior</description>
</color>
<color>
<type>Interior</type>
<value>Beige</value>
<description>Beige Leather Interior</description>
</color>
<color>
<type>Exterior</type>
<value>Black</value>
<description>Glossy Black Exterior</description>
</color>
</attributes>
<dataentry>
<vector>
<entry>
<identifier>
<id>1</id>
<version>1.0</version>
</identifier>
<data>
<originaldatavalue>1000</originaldatavalue>
<absoluteamount>1500</absoluteamount>
<relativeamount>1.5</relativeamount>
<floorlevel>100</floorlevel>
<caplevel>2000</caplevel>
<rawdata>1000</rawdata>
<finaldatavalue>1800</finaldatavalue>
<capapplied>true</capapplied>
</data>
<dimension>
<xaxisvalue>10</xaxisvalue>
<xaxisposition>20</xaxisposition>
<xaxistype>linear</xaxistype>
<type>dimensionType1</type>
<xaxisconvertvalue>1.1</xaxisconvertvalue>
</dimension>
</entry>
<entry>
<identifier>
<id>2</id>
<version>1.1</version>
</identifier>
<data>
<originaldatavalue>2000</originaldatavalue>
<absoluteamount>2500</absoluteamount>
<relativeamount>1.25</relativeamount>
<floorlevel>200</floorlevel>
<caplevel>3000</caplevel>
<rawdata>2000</rawdata>
<finaldatavalue>2800</finaldatavalue>
<capapplied>false</capapplied>
</data>
<dimension>
<xaxisvalue>15</xaxisvalue>
<xaxisposition>25</xaxisposition>
<xaxistype>logarithmic</xaxistype>
<type>dimensionType2</type>
<xaxisconvertvalue>1.2</xaxisconvertvalue>
</dimension>
</entry>
</vector>
</dataentry>
</variable>
<variable>
<identifier>
<id>car002</id>
<version>1.1</version>
</identifier>
<variabletype>Fuel Efficiency</variabletype>
<variablevalue>15</variablevalue>
<displayvariablevale>15 km/l</displayvariablevale>
<displaytype>Analog</displaytype>
<unit>km/l</unit>
<attributes>
<color>
<type>Primary</type>
<value>Blue</value>
<description>Main color</description>
</color>
<color>
<type>Secondary</type>
<value>Silver</value>
<description>Trim color</description>
</color>
<color>
<type>Interior</type>
<value>Gray</value>
<description>Seat color</description>
</color>
<color>
<type>Accent</type>
<value>Black</value>
<description>Exterior accents</description>
</color>
</attributes>
<dataentry>
<vector>
<entry>
<identifier>
<id>3</id>
<version>1.2</version>
</identifier>
<data>
<originaldatavalue>3000</originaldatavalue>
<absoluteamount>3500</absoluteamount>
<relativeamount>1.75</relativeamount>
<floorlevel>300</floorlevel>
<caplevel>4000</caplevel>
<rawdata>3000</rawdata>
<finaldatavalue>3800</finaldatavalue>
<capapplied>true</capapplied>
</data>
<dimension>
<xaxisvalue>20</xaxisvalue>
<xaxisposition>30</xaxisposition>
<xaxistype>exponential</xaxistype>
<type>dimensionType3</type>
<xaxisconvertvalue>1.3</xaxisconvertvalue>
</dimension>
</entry>
<entry>
<identifier>
<id>4</id>
<version>1.3</version>
</identifier>
<data>
<originaldatavalue>4000</originaldatavalue>
<absoluteamount>4500</absoluteamount>
<relativeamount>2.0</relativeamount>
<floorlevel>400</floorlevel>
<caplevel>5000</caplevel>
<rawdata>4000</rawdata>
<finaldatavalue>4800</finaldatavalue>
<capapplied>false</capapplied>
</data>
<dimension>
<xaxisvalue>25</xaxisvalue>
<xaxisposition>35</xaxisposition>
<xaxistype>polynomial</xaxistype>
<type>dimensionType4</type>
<xaxisconvertvalue>1.4</xaxisconvertvalue>
</dimension>
</entry>
</vector>
</dataentry>
</variable>
</variablelist>
</scenariodata>
<scenariodata>
<sname>Truck Information</sname>
<sid>002B</sid>
<sversion>2.0</sversion>
<sstatus>Inactive</sstatus>
<sdatefrom>2024-08-01</sdatefrom>
<variablelist>
<variable>
<identifier>
<id>truck001</id>
<version>2</version>
</identifier>
<variabletype>Pickup</variabletype>
<variablevalue>Available</variablevalue>
<displayvariablevale>Yes</displayvariablevale>
<displaytype>Text</displaytype>
<unit>None</unit>
<attributes>
<color>
<type>Exterior</type>
<value>White</value>
<description>Bright White Exterior</description>
</color>
<color>
<type>Interior</type>
<value>Black</value>
<description>Black Leather Interior</description>
</color>
<color>
<type>Exterior</type>
<value>Silver</value>
<description>Metallic Silver Exterior</description>
</color>
<color>
<type>Interior</type>
<value>Brown</value>
<description>Brown Fabric Interior</description>
</color>
<color>
<type>Exterior</type>
<value>Blue</value>
<description>Navy Blue Exterior</description>
</color>
<color>
<type>Interior</type>
<value>Gray</value>
<description>Gray Leather Interior</description>
</color>
<color>
<type>Exterior</type>
<value>Black</value>
<description>Matte Black Exterior</description>
</color>
</attributes>
<dataentry>
<vector>
<entry>
<identifier>
<id>5</id>
<version>2.0</version>
</identifier>
<data>
<originaldatavalue>5000</originaldatavalue>
<absoluteamount>5500</absoluteamount>
<relativeamount>1.1</relativeamount>
<floorlevel>500</floorlevel>
<caplevel>6000</caplevel>
<rawdata>5000</rawdata>
<finaldatavalue>5800</finaldatavalue>
<capapplied>true</capapplied>
</data>
<dimension>
<xaxisvalue>30</xaxisvalue>
<xaxisposition>40</xaxisposition>
<xaxistype>linear</xaxistype>
<type>dimensionType5</type>
<xaxisconvertvalue>1.5</xaxisconvertvalue>
</dimension>
</entry>
<entry>
<identifier>
<id>6</id>
<version>2.1</version>
</identifier>
<data>
<originaldatavalue>6000</originaldatavalue>
<absoluteamount>6500</absoluteamount>
<relativeamount>1.08</relativeamount>
<floorlevel>600</floorlevel>
<caplevel>7000</caplevel>
<rawdata>6000</rawdata>
<finaldatavalue>6800</finaldatavalue>
<capapplied>false</capapplied>
</data>
<dimension>
<xaxisvalue>35</xaxisvalue>
<xaxisposition>45</xaxisposition>
<xaxistype>exponential</xaxistype>
<type>dimensionType6</type>
<xaxisconvertvalue>1.6</xaxisconvertvalue>
</dimension>
</entry>
</vector>
</dataentry>
</variable>
<variable>
<identifier>
<id>truck002</id>
<version>2.1</version>
</identifier>
<variabletype>Payload Capacity</variabletype>
<variablevalue>2000</variablevalue>
<displayvariablevale>2000 kg</displayvariablevale>
<displaytype>Digital</displaytype>
<unit>kg</unit>
<attributes>
<color>
<type>Primary</type>
<value>Yellow</value>
<description>Main color</description>
</color>
<color>
<type>Secondary</type>
<value>Gray</value>
<description>Trim color</description>
</color>
<color>
<type>Interior</type>
<value>Black</value>
<description>Seat color</description>
</color>
<color>
<type>Accent</type>
<value>White</value>
<description>Exterior accents</description>
</color>
</attributes>
<dataentry>
<vector>
<entry>
<identifier>
<id>7</id>
<version>2.2</version>
</identifier>
<data>
<originaldatavalue>7000</originaldatavalue>
<absoluteamount>7500</absoluteamount>
<relativeamount>1.07</relativeamount>
<floorlevel>700</floorlevel>
<caplevel>8000</caplevel>
<rawdata>7000</rawdata>
<finaldatavalue>7800</finaldatavalue>
<capapplied>true</capapplied>
</data>
<dimension>
<xaxisvalue>40</xaxisvalue>
<xaxisposition>50</xaxisposition>
<xaxistype>logarithmic</xaxistype>
<type>dimensionType7</type>
<xaxisconvertvalue>1.7</xaxisconvertvalue>
</dimension>
</entry>
<entry>
<identifier>
<id>8</id>
<version>2.3</version>
</identifier>
<data>
<originaldatavalue>8000</originaldatavalue>
<absoluteamount>8500</absoluteamount>
<relativeamount>1.06</relativeamount>
<floorlevel>800</floorlevel>
<caplevel>9000</caplevel>
<rawdata>8000</rawdata>
<finaldatavalue>8800</finaldatavalue>
<capapplied>false</capapplied>
</data>
<dimension>
<xaxisvalue>45</xaxisvalue>
<xaxisposition>55</xaxisposition>
<xaxistype>polynomial</xaxistype>
<type>dimensionType8</type>
<xaxisconvertvalue>1.8</xaxisconvertvalue>
</dimension>
</entry>
</vector>
</dataentry>
</variable>
</variablelist>
</scenariodata>
</scenariodatalist>
<mappinglist>
<countrylist>
<country>
<finalcurrencycode>USD</finalcurrencycode>
<isocurrencycode>USD</isocurrencycode>
<finalcurrencydescription>US Dollar</finalcurrencydescription>
<finalcountrycode>US</finalcountrycode>
<isocountrycode>US</isocountrycode>
<finalcountrydescription>United States</finalcountrydescription>
<market>North America</market>
<fxvol>0.25</fxvol>
<irvol>0.15</irvol>
<eqvol>0.30</eqvol>
<paaregion>NA</paaregion>
<countryname>USA</countryname>
</country>
<country>
<finalcurrencycode>EUR</finalcurrencycode>
<isocurrencycode>EUR</isocurrencycode>
<finalcurrencydescription>Euro</finalcurrencydescription>
<finalcountrycode>DE</finalcountrycode>
<isocountrycode>DE</isocountrycode>
<finalcountrydescription>Germany</finalcountrydescription>
<market>Europe</market>
<fxvol>0.20</fxvol>
<irvol>0.12</irvol>
<eqvol>0.25</eqvol>
<paaregion>EU</paaregion>
<countryname>Germany</countryname>
</country>
<country>
<finalcurrencycode>JPY</finalcurrencycode>
<isocurrencycode>JPY</isocurrencycode>
<finalcurrencydescription>Japanese Yen</finalcurrencydescription>
<finalcountrycode>JP</finalcountrycode>
<isocountrycode>JP</isocountrycode>
<finalcountrydescription>Japan</finalcountrydescription>
<market>Asia</market>
<fxvol>0.18</fxvol>
<irvol>0.10</irvol>
<eqvol>0.22</eqvol>
<paaregion>APAC</paaregion>
<countryname>Japan</countryname>
</country>
<country>
<finalcurrencycode>GBP</finalcurrencycode>
<isocurrencycode>GBP</isocurrencycode>
<finalcurrencydescription>British Pound</finalcurrencydescription>
<finalcountrycode>GB</finalcountrycode>
<isocountrycode>GB</isocountrycode>
<finalcountrydescription>United Kingdom</finalcountrydescription>
<market>Europe</market>
<fxvol>0.22</fxvol>
<irvol>0.14</irvol>
<eqvol>0.28</eqvol>
<paaregion>EU</paaregion>
<countryname>United Kingdom</countryname>
</country>
<country>
<finalcurrencycode>AUD</finalcurrencycode>
<isocurrencycode>AUD</isocurrencycode>
<finalcurrencydescription>Australian Dollar</finalcurrencydescription>
<finalcountrycode>AU</finalcountrycode>
<isocountrycode>AU</isocountrycode>
<finalcountrydescription>Australia</finalcountrydescription>
<market>Oceania</market>
<fxvol>0.23</fxvol>
<irvol>0.13</irvol>
<eqvol>0.26</eqvol>
<paaregion>APAC</paaregion>
<countryname>Australia</countryname>
</country>
</countrylist>
<riskmappinglist>
<riskmapping>
<riskid>R1</riskid>
<category>Financial Risk</category>
<l1desc>Market Risk</l1desc>
<l2desc>Interest Rate Risk</l2desc>
<l3desc>Yield Curve Risk</l3desc>
<country1>USA</country1>
<country2>Germany</country2>
<type>Systematic Risk</type>
</riskmapping>
<riskmapping>
<riskid>R2</riskid>
<category>Operational Risk</category>
<l1desc>Internal Process Risk</l1desc>
<l2desc>Technology Risk</l2desc>
<l3desc>Data Security Risk</l3desc>
<country1>UK</country1>
<country2>France</country2>
<type>Specific Risk</type>
</riskmapping>
<riskmapping>
<riskid>R3</riskid>
<category>Compliance Risk</category>
<l1desc>Regulatory Risk</l1desc>
<l2desc>Legal Risk</l2desc>
<l3desc>Anti-Money Laundering Risk</l3desc>
<country1>Japan</country1>
<country2>Australia</country2>
<type>Legal Risk</type>
</riskmapping>
<riskmapping>
<riskid>R4</riskid>
<category>Strategic Risk</category>
<l1desc>Business Strategy Risk</l1desc>
<l2desc>Market Competition Risk</l2desc>
<l3desc>Product Development Risk</l3desc>
<country1>Canada</country1>
<country2>India</country2>
<type>Strategic Risk</type>
</riskmapping>
</riskmappinglist>
</mappinglist>
</responseattribute>
</cardataresponse>
I attempted to use spark-xml to generate database tables for a 30MB XML data file using a recursive approach. However, when I increased the file size to 1.5GB, the containers were killed by an external signal, and the job terminated with a ‘Failed’ status.
New contributor
Sumati Kulkarni is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.