I am trying to read a source file which has a column called Description. The delimiter is ‘,’. This is column can have special character, string, numbers.
The source data is like:
Id | Description | No | Amount |
---|---|---|---|
1 | AbcQ AB2480 BCD abcdefgh 24″ ABC,abcdefgh abcde | null | 120 |
2 | null | 320 |
This data is present in gzip format. I am reading this in spark.
I want the description to be as is. Even if they can’t be as is, I don’t want them to right shift or left shift so it does not disturb the ‘Amount’ column.
I am reading it with code:
inputFileDf1 = (spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("quote",'"')
.option("multiLine","true")
.option("recursiveFileLookup", "true")
.option("compression", "gzip")
.load(path_dm))
Now this code works fine for 1st row ( Id = 1), but for 2nd row I get “,” in description column instead of . Lets say there is another column after amount, the value left shifts and gives incorrect results. 320 which was the amount is now coming in No and so on and now the amount column is messed up for this row.
Id | Description | No | Amount |
---|---|---|---|
1 | AbcQ AB2480 BCD abcdefgh 24″ ABC,abcdefgh abcde | null | 120 |
2 | “,” | 320 | 45 |
To fix the second row, I use
inputFileDf1 = (spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("quote",'"')
.option("escape","")
.option("multiLine","true")
.option("recursiveFileLookup", "true")
.option("compression", "gzip")
.load(path_dm))
This fixes the second row but disturbs the first row.
Id | Description | No | Amount |
---|---|---|---|
1 | “AbcQ AB2480 BCD abcdefgh 24” ABC | abcdefgh abcde” | null |
2 | null | 320 |
Now with this, second row is right but column are shifted for the first row because there was a , present and without escape it splits the description into next column and the amount column is messed up for this row.
Some Constraints:
-
I cannot change the source file – One way would be to change and replace character in source file itself but I cannot do that, I only want to do it while reading
-
I cannot change the delimiter at this point.
What is the best way to resolve this?
Any help is greatly appreciated, thanks!