Everyone.
This problem has troubled me for a long time
My situation is that the data is too large to be imported, so I use Arrow’s open_dataset
to handle it.
I want to convert csv files of different years into parquet format
So I encountered the following problems
- I want to selectively convert several fields into parquet format. When
open_dataset
read data, one of the fields is set to int format (but in fact it should be string), so an error will occur whenwrite_dataset
- So in order to solve the format problem, when using
open_data
, I used shema to change all ints to string format, and usedwrite_dataset
again. I found that as long as I useschema
to change the format, regardless of whether I use select or not,write_dataset
will try to transfer all the fields of the original csv.
Question:I would like to ask if there is any way to use select
(I want to svae column) and schema
(change columns format) at the same time? Thank you!
This a example:
#problem 1: select column but can not convert
p1_gd <- list.files("data/CSV PSM", full.names = TRUE,
pattern = "^r0[1-5]_gd") %>%
file.path() %>%
open_csv_dataset() %>%
select(FEE_YM, APPL_TYPE, HOSP_ID, APPL_DATE, CASE_TYPE,
SEQ_NO, ID, DRUG_DAY, R_HOSP_ID, FUNC_DATE)
write_dataset(
p1_gd,
format = "parquet",
path = "data/parquet2/",
basename_template = paste0("p1_gd-{i}.parquet"),
max_rows_per_file = 1e5
)
# error: Invalid: In CSV column #4: Row #19568:
# CSV conversion error to int64: invalid value 'D'
#problem 2: convert columns type fix problem1, but have another problem
p1_gd <- list.files("data/CSV PSM", full.names = TRUE,
pattern = "^r0[1-5]_gd") %>%
file.path() %>%
open_csv_dataset() %>%
select(FEE_YM, APPL_TYPE, HOSP_ID, APPL_DATE, CASE_TYPE,
SEQ_NO, ID, DRUG_DAY, R_HOSP_ID, FUNC_DATE)
chosen_schema <- schema(
purrr::map(names(p1_gd), ~Field$create(name = .x, type = string()))
)
p1_gd <- list.files("data/CSV PSM", full.names = TRUE,
pattern = "^r0[1-5]_gd") %>%
file.path() %>%
open_csv_dataset(schema = chosen_schema) %>%
select(FEE_YM, APPL_TYPE, HOSP_ID, APPL_DATE, CASE_TYPE,
SEQ_NO, ID, DRUG_DAY, R_HOSP_ID, FUNC_DATE)
write_dataset(
p1_gd,
format = "parquet",
path = "data/parquet2/",
basename_template = paste0("p1_gd-{i}.parquet"),
max_rows_per_file = 1e5
)
# error: Invalid: Could not open CSV input source
# 'C:/Users/user/Desktop/R Project/data/CSV PSM/r01_gd2007.csv':
# Invalid: CSV parse error: Row #1: Expected 10 columns, got 26:
# FEE_YM,APPL_TYPE,HOSP_ID,APPL_DATE,CASE_TYPE,SEQ_NO,R_HOSP_ID,
# R_CASE_TYPE,FUNC_TYPE,FUNC_DATE,DR ...