I’m working on a Python project that involves processing large numpy
arrays using a chain of transformations. I’ve implemented two classes: Sample
for individual arrays and SampleCollection
for managing multiple Sample
instances. The classes use immutable chain methods, returning new instances after each transformation.
Here’s a simplified version of my implementation:
import numpy as np
class Sample:
def __init__(self, array: np.ndarray) -> None:
self.array = array
def mean(self):
new_array = self.array.mean(-1, keepdims=True)
return Sample(new_array)
def normalize(self):
min_array = self.array.min(-1, keepdims=True)
max_array = self.array.max(-1, keepdims=True)
new_array = (self.array - min_array) / (max_array - min_array)
return Sample(new_array)
# Other transformation methods ...
class SampleCollection:
def __init__(self, samples: list) -> None:
self.samples = samples
@property
def arrays(self):
return np.stack([sample.array for sample in self.samples])
def mean(self):
new_arrays = self.arrays.mean(-1, keepdims=True)
return SampleCollection([Sample(array) for array in new_arrays])
def normalize(self):
min_arrays = self.arrays.min(-1, keepdims=True)
max_arrays = self.arrays.max(-1, keepdims=True)
new_arrays = (self.arrays - min_arrays) / (max_arrays - min_arrays)
return SampleCollection([Sample(array) for array in new_arrays])
# Other transformation methods ...
I’m dealing with large datasets, for example:
rand_arr = np.random.rand(3, 1000)
number_of_samples = 150_000
collection = SampleCollection([Sample(rand_arr) for _ in range(number_of_samples)])
I have two main concerns about the current implementation:
- Memory usage: Creating new copies of objects for each transformation leads to significant memory consumption, especially with large arrays.
- Performance: Instantiating new objects in list comprehensions for each transformation may be slow.
What are the best practices or design patterns to optimize memory usage and improve performance for this type of immutable chain method class?