I have a df with 1×10^7 obs and 3 variables that looks like this:
sp_id sample reads
<char> <char> <int>
1: sp1 sample1 255
2: sp1 sample2 1
3: sp1 sample3 1152
4: sp2 sample1 114
5: sp2 sample2 3
---- ---- ---
10000000: sp42500 sample6700 4554
In total I have 6700 different samples and 42500 species.
I would like to spread the data, so each sample is a column and the reads are the values. The missing values would be filled with 0.
I used:
data <- df %>% spread(key=sample, value=nreads, fill=0)
My expected result would be:
sp_id sample1 sample2 sample3 ... sample 6700
<char> <int> <int> <int> <int>
1: sp1 255 1 1152 561
2: sp2 114 3 0 3
---- ---- --- ---- ----
42500: sp42500 715 0 0 4554
The problem
Is that after the 24th column, it is filling everything with 0, as if those samples did not have any species observations (which is not true)
sp_id sample24 sample25 sample26 ... sample 6700
<char> <int> <int> <int> <int>
1: sp1 45 0 0 0
2: sp2 3 0 0 0
---- ---- --- ---- ----
42500: sp42500 715 0 0 0
I have also tried other functions that have given me other errors:
> data <- df %>% pivot_wider(names_from = sample, values_from = nreads, values_fill=0)
Error in `vec_rep_each()`:! `times` can't be missing. Location 1 is missing.Run `rlang::last_trace()` to see where the error occurred.Warning message:In nrow * ncol : NAs produced by integer overflow
> data <- dcast(setDT(df), asv_id ~ sample, value.var = "nreads", fill=0)
Error: Cross product of elements provided to CJ() would result in 2853905958 rows which exceeds .Machine$integer.max == 2147483647
Thank you everyone!