Using pyspark 3.3 (no regexp_extract_all
), I’d like to take a column of values
<code>+-------------------------+
| value |
+-------------------------+
| MONOCYTES 1511|A5905.5 |
+-------------------------+
</code>
<code>+-------------------------+
| value |
+-------------------------+
| MONOCYTES 1511|A5905.5 |
+-------------------------+
</code>
+-------------------------+
| value |
+-------------------------+
| MONOCYTES 1511|A5905.5 |
+-------------------------+
the data format is not constant. ie. The value could also be
1511;MONO->A5905.5
And extract all parts that match the regex r'w?d+.?d*'
. Then, I’d like to replace any extracted value with the value from another dataframe:
<code>+-----------+--------------+
| code | value |
+-----------+--------------+
| 1511 | monocytes1 |
+-----------+--------------+
| A5905.5 | monocytes2 |
+-----------+--------------+
</code>
<code>+-----------+--------------+
| code | value |
+-----------+--------------+
| 1511 | monocytes1 |
+-----------+--------------+
| A5905.5 | monocytes2 |
+-----------+--------------+
</code>
+-----------+--------------+
| code | value |
+-----------+--------------+
| 1511 | monocytes1 |
+-----------+--------------+
| A5905.5 | monocytes2 |
+-----------+--------------+
Such that I can somehow map {"MONOCYTES 1511|A5905.5": ["monocytes1", "monocytes2"]}
What is the fastest way to do this given the version constraint?
Thanks 🙂