I have data with 3 columns:
- Name
- Date
- Value
name | date | value |
---|---|---|
john | 2023-10-10 | 10 |
john | 2023-10-15 | 30 |
john | 2023-11-02 | 5 |
john | 2023-11-09 | 2 |
john | 2023-12-01 | 7 |
john | 2023-12-14 | 8 |
lea | 2024-02-04 | 55 |
lea | 2024-02-10 | 73 |
lea | 2024-04-15 | 3 |
lea | 2024-04-18 | 83 |
lea | 2024-04-28 | 14 |
lea | 2024-05-16 | 17 |
lea | 2024-05-20 | 43 |
lea | 2024-05-25 | 7 |
I want to do a by-group comparison between each pair of rows, to get the difference of value
. The expected output for just John would be:
name | date_orig | date_comparison | value_diff | date_diff | records_in_range_no_extremes |
---|---|---|---|---|---|
john | 2023-10-10 | 2023-10-15 | -20 | -5 | 0 |
john | 2023-10-10 | 2023-11-02 | 5 | -23 | 1 |
john | 2023-10-10 | 2023-11-09 | 8 | -30 | 2 |
john | 2023-10-10 | 2023-12-01 | 3 | -52 | 3 |
john | 2023-10-10 | 2023-12-14 | 2 | -65 | 4 |
john | 2023-10-15 | 2023-11-02 | 25 | -18 | 0 |
john | 2023-10-15 | 2023-11-09 | 28 | -25 | 1 |
john | 2023-10-15 | 2023-12-01 | 23 | -47 | 2 |
john | 2023-10-15 | 2023-12-14 | 22 | -60 | 3 |
john | 2023-11-02 | 2023-11-09 | 3 | -7 | 0 |
john | 2023-11-02 | 2023-12-01 | -2 | -29 | 1 |
john | 2023-11-02 | 2023-12-14 | -3 | -42 | 2 |
john | 2023-11-09 | 2023-12-01 | -5 | -22 | 0 |
john | 2023-11-09 | 2023-12-14 | -6 | -35 | 1 |
john | 2023-12-01 | 2023-12-14 | -1 | -13 | 0 |
Reproducible data
library(tibble)
df <-
tibble::tribble(
~name, ~date, ~value,
"john", "2023-10-10", 10L,
"john", "2023-10-15", 30L,
"john", "2023-11-02", 5L,
"john", "2023-11-09", 2L,
"john", "2023-12-01", 7L,
"john", "2023-12-14", 8L,
"lea", "2024-02-04", 55L,
"lea", "2024-02-10", 73L,
"lea", "2024-04-15", 3L,
"lea", "2024-04-18", 83L,
"lea", "2024-04-28", 14L,
"lea", "2024-05-16", 17L,
"lea", "2024-05-20", 43L,
"lea", "2024-05-25", 7L
)
Based on this answer, I could try to use dplyr::left_join()
to join df
to itself by name
. Then, we need to filter out both duplicates and same-range rows. This became cumbersome, as I had to use purrr::map()
and multiple steps that make my solution slow.
I feel like there has to be a simpler way, especially given the many verbs dplyr offers.