Select and convert columns to write_dataset in Arrow
everyone
I want to convert the selected fields and field formats of csv files of different years into parquet format.
(The reason arrow is used here is because it is difficult to import data into R if it exceeds 30GB)
Strategy form joining two very large arrow datasets without blowing up memory usage
I have two very large datasets in parquet files that I’m reading using R arrow::open(dataset)
. One file has over 20 million rows and the other, 15 million.
Left_join function use in Arrow
everyone!
Recently, I am using arrow to process data exceeding 500G. The reason I use Arroww is because R seems to be unable to import data exceeding a certain GB.
I found that similar problems often occur during left_join.
I want the key of ID attribute of left_join to be the same, but she always shows that there is a problem with other fields. Does anyone know how to deal with it?