I have a Pyspark dataframe with one array column. Each array contains string elements. I need to extract those elements that have a specific length.
id | array_with_strings |
---|---|
00001 | [N, NS, NSY, NSB] |
00002 | [B, BS, BSN, BSD] |
I am aiming for this:
id | twoDigits |
---|---|
00001 | NS |
00002 | BS |
I tried:
def getTwoDigits(arr):
for x in arr:
if F.length(x) == 2:
return x
else:
return None
extractTwoDigits_udf = F.udf(lambda z: getTwoDigits(z))
df = df.withColumn(
"twoDigits", extractTwoDigits_udf(F.col("array_with_strings"))
)
I am not really sure what I should use in this specific instance.
Thanks in advance for any help or hint.