The sample data is as below (The data is fake, not real data):
Key | Death indicator | Date Death | Exact date of death | Death Cause |
---|---|---|---|---|
00 | Alive | |||
02 | Death hos | Y | 25/9/2011 | N00 |
03 | Alive | |||
09 | Death hos | Y | J189 | 28/8/2015 |
07 | Death nonhos | 12/6/2018 | Y | C20 |
From the table, you can see the types of data within the same columns are not consistent. Date Death
should be in date format; Exact date of death
should only contain Y
, N
or blank; Death Cause
should be in string (i.e. ICD code).
I forgot to mention one important thing, the date format may not be consistent, e.g. ’01-05-2010′,’01 May 2010′ can also appear in the date columns.
I tried to perform some basic data cleaning:
Python:
import pandas as pd
death_y_n = death['Date Death'][pd.to_datetime(death['Date Death'],
format='%d/%m/%Y',
errors = 'coerce')
.isnull()]
death_disease_case = death['Exact date of death'][~((death['Exact date of death'].isin(['Y','N']))
|(death['Exact date of death'].isnull()))]
death['Death Cause'][~pd.to_datetime(
death['Death Cause'],
format='%d/%m/%Y', errors = 'coerce')
.isnull()] =
death_disease_case
death['Date Death'][pd.to_datetime(
death['Date Death'],
format='%d/%m/%Y', errors = 'coerce')
.isnull()] =
death_to_date[pd.to_datetime(
death['Date Death'],
format='%d/%m/%Y', errors = 'coerce')
.isnull()]
death['Exact date of death'][~death['Exact date of death'].isin(['Y','N'])] =
death_y_n[~death['Exact date of death'].isin(['Y','N'])]
death['Death Cause'][pd.to_datetime(
death['Date Death'],
format='%d/%m/%Y', errors = 'coerce')
.isnull()] =
death_y_n[pd.to_datetime(
death['Date Death'],
format='mixed', errors = 'coerce')
.isnull()]
R:
library(tidyverse)
library(magrittr)
library(anytime)
library(Hmisc)
death_to_date = anytime(death$`Death Cause`) %>% as.character
death_y_n = death$`Date Death`[is.na(as_date(death$`Date Death`))]
death_disease_case = death$`Exact date of death`[death$`Exact date of death` %nin% c('Y','N')]
death$`Death Cause`[!is.na(as_date(death$`Death Cause` ))] = death_disease_case[!is.na(as_date(death$`Death Cause` ))]
death$`Date of Registered Death`[is.na(as_date(death$`Date Death`))] = death_to_date[is.na(as_date(death$`Date Death`))]
death$`Exact date of death`[death$`Exact date of death` %nin% c('Y','N')] = death_y_n[death$`Exact date of death` %nin% c('Y','N')]
However, due to multiple formats of date, some date formats cannot be parsed successfully. Is there a method to swap the columns without using to_datetime()
/anytime()
?
I am new to Python, if there are any mistakes I made, please point them out! Thank you.
Updated:
My python solution:
import pandas as pd
#for death date variable save as exact date of death:'Y'/'N'
death_to_date_index_exact = (death['Date Death'].isin(['Y','N']))
death_to_date_exact = death['Date Death'][death_to_date_index_exact]
#for death cause variable save as date of death
death_cause_index_date = (~death['Death Cause'].str.contains('^[A-Za-z].*[0-9]$',na=True))
death_cause_date = death['Death Cause'][death_cause_index_date]
#for exact date of death variable save as death cause
death_exact_index_cause = (death['Exact date of death'].str.contains('^[A-Za-z].*[0-9]$',na=False))
death_exact_cause = death['Exact date of death'][death_exact_index_cause]
death['Date Death'][death_cause_index_date] = death_cause_date
death['Exact date of death'][death_to_date_index_exact] = death_to_date_exact
death['Death Cause'][death_exact_index_cause] = death_exact_cause
#Convert the date in death cause into empty
death['Death Cause'][~death['Death Cause'].str.contains('^[A-Za-z].*[0-9]$',na=True)] = np.nan
1
There’s a few different ways you can do this. The simplest way to me, which doesn’t require creating a new vector three times, is to create a named list in R, then unnest it wider, as follows:
library(tidyverse)
df |>
mutate(sorted = pmap(list(Date.Death, Exact.date.of.death, Death.Cause), (...) {out <- list()
for (x in c(...)) { # for each row of the three columns above
if (x %in% c("Y", "N", NA)) { # we do this one first, otherwise the NAs throw things off
out$Exact.date.of.death <- x
} else if (grepl("/", x)) {
out$Date.Death <- x
} else {
out$Death.Cause <- x
}
}
out
}), .keep = "unused") |> # drop the columns we used (i.e. Date.Death, Exact.date.of.death, Death.Cause)
unnest_wider(sorted) # then unnest the named list column, making new versions of them!
Output:
# A tibble: 5 × 5
Key Death.indicator Exact.date.of.death Date.Death Death.Cause
<chr> <chr> <chr> <chr> <chr>
1 Alive NA NA NA NA
2 Death hos Y 25/9/2011 N00
3 Alive NA NA NA NA
4 Death hos Y 28/8/2015 J189
5 Death nonhos Y 12/6/2018 C20
Data:
df <- read.table(text="Key Death.indicator Date.Death Exact.date.of.death Death.Cause
00 Alive NA NA NA NA
02 Death hos Y 25/9/2011 N00
03 Alive NA NA NA NA
09 Death hos Y J189 28/8/2015
07 Death nonhos 12/6/2018 Y C20", header=TRUE)
Notes:
- You’ve said you’ve had trouble parsing your dates. It would be helpful if you could give some examples of the dates you are struggling to parse.
- Your data still needs some cleaning – for instance,
"Y"
and"N"
should beTRUE
andFALSE
. Also if the data is death data, then it would make sense to remove the Alive ones, they add nothing. Also, blanks shouldNA
.
3
Dates are tricky, but you might want to check out the lubridate package. What errors does the R code throw? If you’re getting stopped by errors I would suggest starting the dates out as a character column, getting all the data in the same column, then parse the NA values. Once you have everything in the correct format, then coerce the column to date format. Since the character column will accept your changes, you’ll be able to do whatever cleaning you need to do. The problem with coercing to date format on the first step is that often these generic conversion functions like as.date()
are very picky. If any one value is in the wrong format, the whole thing will fail and throw an error. So if you’re getting caught up in that, try doing that conversion with anytime
in the last step and see if you don’t have better results.
1