Is there a function in the arrow package to get a vector of hive-style partition names from an opened hive-style dataset?
I am using MPI (via pbdMPI) with the arrow and dplyr packages on a cluster for multinode parallel reads of a very large parquet data set. The hive-structured parquet directories are convenient for parallel arrow–dplyr reads of directory batches with MPI. To make it work, I need a character vector of the directory name values to give dplyr::filter()
for independent arrow–dplyr reads of subdirectory sets. For example, with directories:
'/dir1/dir2/flightDate=2022-04-17/dir3/part-0.parquet'/
'/dir1/dir2/flightDate=2022-05-18/dir3/part-0.parquet'/
'/dir1/dir2/flightDate=2022-06-18/dir3/part-0.parquet'/
'/dir1/dir2/flightDate=2022-07-19/dir3/part-0.parquet'/
'/dir1/dir2/flightDate=2022-08-19/dir3/part-0.parquet'/
'/dir1/dir2/flightDate=2022-09-19/dir3/part-0.parquet'/
'/dir1/dir2/flightDate=2022-10-20/dir3/part-0.parquet'/
I want the vector dates = c('2022-04-17', '2022-05-18', '2022-06-18', '2022-07-19', '2022-08-19', '2022-09-19', '2022-10-20')
. That I use in
library(pbdMPI)
library(arrow)
library(dplyr)
my_dates = dates[comm.chunk(length(dates), form = "vector")]
my_data = ds %>% filter(flightDate %in% my_dates) %>% collect()
to read different data chunks in parallel by each MPI rank.
I can get it with list.dirs()
and some string manipulation, but is there an arrow-native function that would get it? Ideally, after ds = arrow::open_dataset(<filename>)
, the function would operate on the ds
structure that has the directory information.
1