The sample data is as below (The data is fake, not real data):
Key | Death indicator | Date Death | Exact date of death | Death Cause |
---|---|---|---|---|
00 | Alive | |||
02 | Death hos | Y | 25/9/2011 | N00 |
03 | Alive | |||
09 | Death hos | Y | J189 | 28/8/2015 |
07 | Death nonhos | 12/6/2018 | Y | C20 |
From the table, you can see the types of data within the same columns are not consistent. Date Death
should be in date format; Exact date of death
should only contain Y
, N
or blank; Death Cause
should be in string (i.e. ICD code).
I tried to perform some basic data cleaning:
Python:
import pandas as pd
death_y_n = death['Date Death'][pd.to_datetime(death['Date Death'],
format='%d/%m/%Y',
errors = 'coerce')
.isnull()]
death_disease_case = death['Exact date of death'][~((death['Exact date of death'].isin(['Y','N']))
|(death['Exact date of death'].isnull()))]
death['Death Cause'][~pd.to_datetime(
death['Death Cause'],
format='%d/%m/%Y', errors = 'coerce')
.isnull()] =
death_disease_case
death['Date Death'][pd.to_datetime(
death['Date Death'],
format='%d/%m/%Y', errors = 'coerce')
.isnull()] =
death_to_date[pd.to_datetime(
death['Date Death'],
format='%d/%m/%Y', errors = 'coerce')
.isnull()]
death['Exact date of death'][~death['Exact date of death'].isin(['Y','N'])] =
death_y_n[~death['Exact date of death'].isin(['Y','N'])]
death['Death Cause'][pd.to_datetime(
death['Date Death'],
format='%d/%m/%Y', errors = 'coerce')
.isnull()] =
death_y_n[pd.to_datetime(
death['Date Death'],
format='mixed', errors = 'coerce')
.isnull()]
R:
library(tidyverse)
library(magrittr)
library(anytime)
library(Hmisc)
death_to_date = anytime(death$`Death Cause`) %>% as.character
death_y_n = death$`Date Death`[is.na(as_date(death$`Date Death`))]
death_disease_case = death$`Exact date of death`[death$`Exact date of death` %nin% c('Y','N')]
death$`Death Cause`[!is.na(as_date(death$`Death Cause` ))] = death_disease_case[!is.na(as_date(death$`Death Cause` ))]
death$`Date of Registered Death`[is.na(as_date(death$`Date Death`))] = death_to_date[is.na(as_date(death$`Date Death`))]
death$`Exact date of death`[death$`Exact date of death` %nin% c('Y','N')] = death_y_n[death$`Exact date of death` %nin% c('Y','N')]
However, due to multiple formats of date, some date formats cannot be parsed successfully. Is there a method to swap the columns without using to_datetime()
/anytime()
?
I am new to Python, if there are any mistakes I made, please point them out! Thank you.