Lets say for example I have a 20GB of data that I read (in single query) from some database, and it is stored in fetched_data. This fetched_data is preprocesssed and then upserted in another database. (I need ALL of the data, I cant use Map-reduce or something similar)
fetched_data = <read data from database>
Now there is a simple option in storing all of this in RAM, then read it, and preprocess all data sequentially. I know this is expensive and has risk of over-exausting memory etc, but this option is feasible for me.
My question is would it be “smart” do to this in another way??
I would read data from database in chunks (SO-Read query in chunks), after the query returns the data, I would save that data on disk. That would be done for all the chunks. Now when all the chunks are stored in disk, I would use multithreading to read and prepreprocess data.
(want to do some parsing, I currently dont know whether to use multithreading or multiprocessing, because one part is reading from disk (I/O bound), and another part is parsing data??)
After all the data is prepreprocessed I would use multithreading to send this data (stored and preprocessed on disk) to another database.
Would this be a good practice?