Hello stack overflow community,
I am currenty facing an issue in an Azure Synapse Analytics Spark Notebook that I do not understand and need help with (hint: The Spark version of the Apache Spark pool is 3.3, Python version is 3.10).
The issue:
When I run the following code in a spark notebook codeblock as pure SQL (with magic command “%%sql” in the first row), the results are as follows:
%%sql
SELECT SHA2('André', 256) AS `Test1a`;
*Result --> 7d918307f08192d8259055ff77363a59bc388762dd613bd5905aa1c1f76b9ae7*
SELECT CAST(SHA2('André', 256) AS BINARY) AS `Test2a`;
*Result (shortened): "[55,100,57,49,56,51,48,55,102,48,56,49,57,50,100,5..."*
SELECT ENCODE(SHA2('André', 256), 'UTF-8') AS `Test3a`;
*Result (shortened): "[55,100,57,49,56,51,48,55,102,48,56,49,57,50,100,5..."*
Note that test cases “Test2a” and “Test3a” return data in data type BINARY.
When I run the same code in a codeblock as PySpark (with magic command “%%pyspark” in the first row), the results differ for test cases “Test2b” and “Test3b” in comparison to their above counterparts:
%%pyspark
spark.sql('''SELECT SHA2('André', 256) AS `Test1b`;''').show(truncate=False)
*Result --> 7d918307f08192d8259055ff77363a59bc388762dd613bd5905aa1c1f76b9ae7*
spark.sql('''SELECT CAST(SHA2('André', 256) AS BINARY) AS `Test2b`;''').show(truncate=False)
*Result --> [37 64 39 31 38 33 30 37 66 30 38 31 39 32 64 38 32 35 39 30 35 35 66 66 37 37 33 36 33 61 35 39 62 63 33 38 38 37 36 32 64 64 36 31 33 62 64 35 39 30 35 61 61 31 63 31 66 37 36 62 39 61 65 37]*
spark.sql('''SELECT ENCODE(SHA2('André', 256), 'UTF-8') AS `Test3b`;''').show(truncate=False)
*Result --> [37 64 39 31 38 33 30 37 66 30 38 31 39 32 64 38 32 35 39 30 35 35 66 66 37 37 33 36 33 61 35 39 62 63 33 38 38 37 36 32 64 64 36 31 33 62 64 35 39 30 35 61 61 31 63 31 66 37 36 62 39 61 65 37]*
I was expecting that test cases “Test2a” and “Test2b” would produce identical outputs.
The same counts for test cases “Test3a” and “Test3b”, but obviously that is not the case.
I want to understand why the results differ and how to control it so that the results will be identical, no matter whether the set magic command is %%sql or %%pyspark.
Best regards,
MJ
P.S.: It is my first post on StackOverflow in years, so please be constructive with your criticism regarding this post. Thx.
MarieJoana is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.