I have text in a pyspark column called TEXT
that look like below:
The sky is red. I have 2 apples and I am fine.
----------------------------------------------
The sky is back. I have 8 apples or I am fine.
----------------------------------------------
The sky is back. she has 8 apples and I am fine.
----------------------------------------------
The hill is red. I have 3 apples and I am fine
.
.
.
I want to create Regx pattern and use F.regexp_extract
function to extract the number of apples in this column. However, I am restricted in several ways:
- Only interested in sentences that start with
The sky
- Only interested in the number of apples after
I have
- Only interested in the number of apples that continues by
and
.
Thus, I would like my Regx pattern to only extract the number of apples from the first sentence (i.e., 2 apples)
This is the Regx pattern that I came up with: regex_pattern = (?<=The sky.*?)(?<=I have )(.*?)(?= apples and)
It perfectly works via this website http://regexstorm.net/tester.
However, when I use in my Pyspark database as:
df.withColumn('Number_Apples', F.regexp_extract(F.col("TEXT"), regex_pattern, 0))
It raise the following Error:
SparkRuntimeException: [INVALID_PARAMETER_VALUE] The value of parameter(s) 'regexp' in regexp_extract is invalid
Can anyone let me know please how can I modify the Regx pattern to compile with Pyspark functionality?
This is my code example which raise the Error:
import pyspark.sql.functions as F
from pyspark.sql import DataFrame as SparkDataFrame
regex_pattern = '(?<=The sky.*?)(?<=I have )(.*?)(?= apples and)'
df = spark.createDataFrame(
[
('The sky is red. I have 2 apples and I am fine.', 2),
('The sky is back. I have 8 apples or I am fine.', 8),
('The sky is back. she has 8 apples and I am fine.', 8),
('The hill is red. I have 3 apples and I am fine', 3),
],
["TEXT", "APPLES"] # add your column names here
)
df = df.withColumn('Number_Apples', F.regexp_extract(F.col("TEXT"), regex_pattern, 0))
df.display()