everyone!
Recently, I am using arrow to process data exceeding 500G. The reason I use Arroww is because R seems to be unable to import data exceeding a certain GB.
I found that similar problems often occur during left_join.
I want the key of ID attribute of left_join to be the same, but she always shows that there is a problem with other fields. Does anyone know how to deal with it?
I specified that I only need to import the ID and index_af2 columns, but she always said there was a problem with the other columns, and I will not use those columns at the moment.
# This is example
a1_csv <- read.csv("data/output/AF_1.csv")
a1_csv %>% head()
a1_arrow <- list.files("data/output", full.names = TRUE,
pattern = "^AF") %>%
file.path() %>%
open_csv_dataset(schema = schema(ID = string(),
index_af2 = string())) %>%
select(ID, index_af2)
a1_csv <- as_arrow_table(a1_csv)
join_af <- a1_csv %>%
left_join(a1_arrow, by="ID") %>%
collect()
> a1_csv %>% head()
ID index_amd2
1 8b1032638ec01b04218e01549cc97164 2006-01-04
2 225feae21b2de084003aad26ef351f41 2006-01-06
3 515882d63972f0c2f4e982dd9c09659e 2006-01-14
4 97e767452e5bfcc1b0a1f7013a546b94 2006-02-04
5 37726c0ee9ec064e56d97b3df3e5808d 2006-02-10
6 cab71a7bd1a5306fe02d682baa6ecfb2 2006-02-15
> a1_csv <- as_arrow_table(a1_csv)
> join_af <- a1_csv %>%
+ left_join(a1_arrow, by="ID") %>%
+ collect()
Error in `compute.arrow_dplyr_query()`:
! Invalid: Could not open CSV input source 'C:/Users/user/Desktop/R Project/Work Case/NHIRD practice/data/output/AF_1.csv':
Invalid: CSV parse error: Row #1: Expected 2 columns, got 4: "ID","index_af2","index_amd2","date"
Run `rlang::last_trace()` to see where the error occurred.