What are the efficient ways to process huge lists (+10 millions), and things to consider while manipulating huge lists.
First question, when should I use recursion, and when I shouldn’t. In both cases, we are allocating memory to store data (the memory-complexity should be equal), using recursive approach with a huge list would be memory-consuming (stocking the call in Heap-memory), as well as storing data in a list (stack-memory). Which approach with less-cost? (usually we can write any recursive algorithm to iterative algorithm).
Second question, When it comes to memory-comlexity, should I think of transforming this list to other data structure (like LL, BST, .. to easily manipulate the data), The problem with such solutions is that the memory allocation in this case would be doubled (every node would have reference+node value instead of a value in a list).
Or should I think of hard-copying this list to disk (text-data or csv), and process the data by portion. The problem of this solution is that some operations demand the correlation between values, we can’t process every portion independently, then reduce and concatenate the data.
Scenario 1: Sorting a huge list (Merge Sort(Recursive) Vs others )
Scnario 2: let’s consider this quick example (it’s not the pefect and the most relevant to what I am trying to ask, but just to show another example).
We are adding product of 4 successive elements to every element of those four elements.
L=range(100)
from operator import mul
def prodElement(L):
getItems= lambda i: L[i:i+4]
list_mul=[]
for i in xrange(0,len(L)-3,4):
p=reduce(lambda x,y:x*y,getItems(i))
L[i],L[i+1],L[i+2],L[i+3]= L[i]+p,L[i+1]+p,L[i+2]+p,L[i+3]+p
return L
def prodElement_rec(r,L,i=0):
if i>=len(L)-1: #Stop
return res
else:
if i%4==0 or i==0:
p= L[i]*L[i+1]*L[i+2]*L[i+3]
L[i],L[i+1],L[i+2],L[i+3]= L[i]+p,L[i+1]+p,L[i+2]+p,L[i+3]+p
return prodElement_rec(r,L,i+4)
res=[]
assert prodElement_rec(res,L)==prodElement(L)
Again, I am not trying to solve a specific issue, but am trying to understand the best approaches to be used when it comes to huge lists (+10 millions items). What are the methods that I should think of, and critical issues that I should avoid in order to solve problem using huge lists.
2
In the current form your question is likely to be closed as “too broad”, but since you asked in this broad form, I try to give you a broad answer:
-
recursion: don’t use recursion when the expected recursion depth will be in the order of magnitude of your list’s size. Especially when your programming language does not optimize tail recursions automatically (Python does not). And assumed your recursion depth stays reasonable: use it only when it makes the code simpler and easier to understand. In your example above, the recursion depth will be about “list size /4” (which will probably end in a stack overflow when your list size is as large as you wrote), and the code is definitely not simpler than the non-recursive variant.
-
writing data to disk: do this only when you expect to exceed the available main memory of your system (which means the memory available to your program). This will imply the need for a so-called external algorithm for the give task, which is almost always more complicated than using in-memory algorithms. So do this only if you must. EDIT, due to the comment above: using a database is a good alternative for a lot of scenarios. This will introduce some additional programming overhead on one hand, but can save a lot at the other.
-
which data structure to use: start with the most simplest data structure which is suitable for the given task, and then measure your performance (maybe with smaller lists first, but be sure you know the approximate order of growth of your algorithm’s runtime when you start extrapolating for bigger lists). Only when your code does not run fast enough, try more complex data structures and algorithms (and don’t forget to measure again).
To sum it up: it depends on what you are going to do with that lists, and your memory and time constraints.