here I have two datasets with patient ID, Visit, Visit dates and Results, one should be used as “Main” dataset with correct information while the other one has some inconsistent between Visit and Visit dates. Therefore, I want to merge the second dataset to the “Main” dataset to correct the visit dates. Please note that one visit can have multiple visit dates.
I have two scenarios:
1.One visit has multiple visit dates, the second dataset only have one with correct information, so I want to add a new row with the missing visit dates but with empty result.
2.The visit has only one visit date but the second dataset has the wrong date, so I want to correct the date only.
A very simple example: the main dataset looks like:
Patient ID | Visit | Visit Date | Result1 |
---|---|---|---|
A | visit1 | 2024-9-11 | 1 |
A | visit1 | 2024-9-12 | 2 |
B | visit1 | 2024-9-13 | 3 |
The second dataset looks like:
Patient ID | Visit | Visit Date | Result2 |
---|---|---|---|
A | visit1 | 2024-9-11 | 123 |
B | visit1 | 2024-9-11 | 456 |
And the dataset I want:
Patient ID | Visit | Visit Date | Result1 | Result2 |
---|---|---|---|---|
A | visit1 | 2024-9-11 | 1 | 123 |
A | visit1 | 2024-9-12 | 2 | . |
B | visit1 | 2024-9-13 | 3 | 456 |
Here patient A and B represent the scenarios 1 and 2. Hope someone could help to figure out a simple SAS code to solve this problem. Thanks in advance!
1
From your description, we can trust the date in Dataset 2 if there are multiple dates in Dataset 1, but we cannot trust the date in Dataset 2 if there is only one date in Dataset 1. The only way around this problem I see is to consider two separate lookups using the two scenarios:
- If there are multiple visits, lookup by
patient_id, visit, visit_date
- If there there is only one visit, lookup by
patient_id, visit
Both of these should yield unique lookups for each scenario. If you create a new variable, n_visits
, to count the number of visits for each patient, you can do a dual lookup using hash tables in a DATA Step to account for each scenario.
proc sql;
create table dataset1_counts as
select *, count(*) as n_visits
from dataset1
group by patient_id, visit;
quit;
data want;
set dataset1_counts;
if(_N_ = 1) then do;
/* Scenario 1: Multiple visit dates */
dcl hash h1(dataset:'dataset2');
h1.defineKey('patient_id', 'visit', 'visit_date');
h1.defineData('result2');
h1.defineDone();
/* Scenario 2: One visit date */
dcl hash h2(dataset:'dataset2');
h2.defineKey('patient_id', 'visit');
h2.defineData('result2');
h2.defineDone();
call missing(result2);
end;
/* Use the first lookup for multiple visits. Otherwise, use the second. */
if(n_visits > 1) then rc = h1.Find();
else rc = h2.Find();
drop n_visits rc;
run;
patient_id visit visit_date result1 result2
A visit1 2024-09-11 1 123
A visit1 2024-09-12 2 .
B visit1 2024-09-13 3 456