Hello fellow developers,
I am at the moment working on a real time data processing system in Python and Pandas where I am faced with large data frames that cannot be held in memory fully. My objective is to balance the usage of available memory resources at the same time achieving higher efficiency and productivity in operations on and manipulations with data.
In particular, let me have the following data source, with millions of records and some numerical and categorical fields. At the moment, I am using chunks and attempt to read chunks of data using Pandas read_csv with chunksize, but I am having memory issues and speed problems.
Was there any possibility to recommend more complex approaches or programs to work with such amounts of data in Python? At the moment I am keen on techniques for efficient data gathering, processing, and summarizing on big data streams.
Any help or example code will be highly appreciated for developing the above UI design. Thank you!
First, I tried to create a real-time data processing pipeline which is a common approach when working with large datasets that cannot fit into the memory; for this I used Pandas chunking with read_csv function and setting chunksize. My expectation was to handle the data chunks in a way that every processing step can be executed as fast as possible, occupy as less memory as possible and be as general as possible. Nevertheless I experience problems concerning memory and computation especially where data streams are very large and necessitate the use of optimized techniques or packages that can efficiently manage such sizes in Python.
kiruthikpurpose is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.