Relative Content

Tag Archive for pythonpysparkdatabricks

Flag IDs that have a null value ONLY across repeat observations (pandas/pyspark)

Python/Pyspark noob here. I have a dataset that has an ID variable and multiple rows (# varies) per that ID. An additional variable called ‘description’ is the character variable I’m interested in. I need to check and see if an ID value has all values for description (rows) = null per ID, all rows ne null, or a mixture of null and non-null values. Ideally, I’d want to separate them to where I have a dataset of all null per ID and everything else. My first thought/hope was that the dataset was mutually exclusive and a ID with 1 missing description was missing all rows for description by that ID. Tried testing that by getting unique ID count on all null descriptions and unique count on non-null hoping it would add to the value of total unique ID’s. It doesn’t and it seems like there are some IDs that have both null and non-null descriptions. How do I tease this out? Thanks in advance!