In Spark SQL, comparing a string column to an integer value can lead to unexpected results due to type casting resulting in an empty result set.
Code Run in Spark-SQL
case class Person(id: String, name: String)
val personDF = Seq(Person("a", "amit"), Person("b", "abhishek")).toDF()
personDF.createOrReplaceTempView("person_ddf")
val sqlQuery = "SELECT * FROM person_ddf WHERE id <> -1"
val resultDF = spark.sql(sqlQuery)
resultDF.show() // Empty result due to type casting issue
Result
id | name |
---|---|
Explain Plan
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT ('id = -1)
+- 'UnresolvedRelation [person_ddf], [], false
== Analyzed Logical Plan ==
id: string, name: string
Project [id#356, name#357]
+- Filter NOT (cast(id#356 as int) = -1)
+- SubqueryAlias person_ddf
+- View (`person_ddf`, [id#356,name#357])
+- LocalRelation [id#356, name#357]
But when I m using the same query and table in Redshift which is based on PostGreSQL. I am getting the desired result.
select * from person where id <> -1;
Result
id | name |
---|---|
a | amit |
b | abhishek |
Explain Plan
XN Seq Scan on person (cost=0.00..0.03 rows=1 width=336)
Filter: ((id)::text <> '-1'::text)
Does someone has an explanation for this? Shouldn’t spark-SQL also be working same as Redshift?
Here I can see that in Exec Plan for spark, ID as casted as int, whereas in Redshift Id is casted as varchar.
ABHISHEK SINGH is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.