I know , I am asking very basic question here , But is there any way to replace first occurrence of character within pyspark dataframe.
I have below value within dataframe.
Gourav#Joshi#Karnataka#US#English
I only want to replace first occurrence of # within dataframe.
Expected Output:
Gourav Joshi#Karnataka#US#English
1
You can use the split
function to split the string into two parts based on the first #, and then use the array_join
function to connect them.
import pyspark.sql.functions as F
...
df = df.select(F.array_join(F.split('col', '#', 2), ' '))
Just use regexp_replace and capture the sub-string before the 1st #
as $1
:
spark.sql("""
select col, regexp_replace(col,'^([^#]*)#','$1 ') col_new
from values ('Gourav#Joshi#Karnataka#US#English') as (col)
""").show(1,0)
+---------------------------------+---------------------------------+
|col |col_new |
+---------------------------------+---------------------------------+
|Gourav#Joshi#Karnataka#US#English|Gourav Joshi#Karnataka#US#English|
+---------------------------------+---------------------------------+
Try this:
ab=ab.withColumn("string1",split(col("string"),"#").getItem(0))
ab=ab.withColumn('new', split('string', '#'))
.withColumn("ok",slice(col("new"), 2, size(col("new"))))
ab=ab.withColumn("string2",concat_ws("#",col("ok"))).drop("ok","new")
ab.show(truncate=False)
I think the most feasible solution here is to use the regex_replace and repalce # with the space.
from pyspark.sql.functions import regexp_replace, col
df = df.withColumn("new_column", regexp_replace(col("initialcol"), r"#", " ", 1))
1