I am trying to capitalize 1st letter and lowercase rest of the letters of string in spark sql.
spark.sql("select regexp_replace('new york','(\\w)(\\w*)',upper('$1')||lower('$2'))")
Expecting: New York
Output: new york
It is replacing the values but not changing their cases.
If I add characters manually it works
spark.sql("select regexp_replace('new york','(\\w)(\\w*)',upper('n')||lower('ORK'))")
Output: Nork Nork (works but not the result I need)
I want to approach it using spark-sql only not using the dataframe methods.
I know the initcap function but it works only with space as separator, does not work with ‘-‘ or ‘,’ as separator.
1
Regex says U$1
is what you’re looking for. But looks like Spark’s regex_replace()
doesn’t support it.
$ echo 'new york' | sed 's/(b[a-z])/U1/g'
New York
$
You could make it complicated by doing this whole thing in SQL. E.g. splitting words into arrays and then splitting each word, converting first char to upper and then joining the whole thing back.
But IMO it’s infinitely easier to just register a UDF.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
spark.udf.register(
"customCaseUDF",
lambda s: ' '.join([w[0].upper() + w[1:] for w in s.split()]),
StringType()
)
Then you can use it in SQL:
select customCaseUDF('new york')
you can’t use your Sql function (upper, lower) inside the regex_replace function
Try this like
spark.sql("
select
concat(
upper(regexp_replace('new york', '(new)', 'N')), -- Replace 'new' with 'N'
' ',
lower(regexp_replace('new york', '(york)', 'ORK')) -- Replace 'york' with 'ORK'
) as result
")
- Converts the first letter to uppercase using upper.
- Converts the remaining part to lowercase using lower.
2