I have sample data as in below string format in Hive table
+———————-+
| col1 |
+———————-+
| 160-80-40 sec|
| 160-80-40 sec|
| 10-10-10-20-20-30 min|
| 10-10-10-20-20-30 min|
| 10-20-30-40-50-60 min|
| 200-100-100 sec|
| 400 200|
+———————-+
I need to find the sum of array, max value, min value
when I load this hive table in spark, it will infer the datatype to string
val df = spark.sql("select col1 from table1")
I tried to use split and size in spark dataframe,
first I split int and string using split function
then again used split with “-” and calculate the sum of array
as it is string type, got error
org.apache.spark.sql.AnalysisException: cannot resolve ‘sum(split(`new_col1, ‘-‘))’ due to data type mismatch: function sum requires numeric types, not array;;
val df1 = df.withColumn("new_col1", split('col1, " ")(0))
df1.withColumn("new_col2", sum(split('new_col1, "-")))
is there a way to cast array[string] to array[int] in dataframe
I would like to solve this problem without using UDFs