I am reading from my POSTGRES table using Spark JDBC with 10 connections, and using RNO as the partition column with boundaries ranging from 0 to the number of rows in the table.
This table is dynamic and increases every second.
The first step is to count the rows, and then I use this value to set the upperBound.
The issue we are facing is that we always get duplicate rows, or there are missing rows that I expect to see but cannot find.
val rows = calcCountRows(table)
val numPartitions = math.min(partitions, rows)
spark.read
.format("jdbc")
.options(database.connectionDetails())
.option("fetchsize", 100)
.option("dbtable",
s"""
|(select ROW_NUMBER() OVER(ORDER BY NULL) AS RNO, subQuery.* from $table subQuery) as "$table"
|""".stripMargin)
.option("partitionColumn", "RNO")
.option("lowerBound", 0)
.option("upperBound", rows)
.option("numPartitions", numPartitions)
.load()
When I fetch data from the DataFrame, I sometimes see duplicate entries, and if I try to read again, I see different duplicate entries or even missing entries.
How can I handle this issue?