My work requires constructing large square matrices of order 1000 as numpy arrays, whose elements are defined analytically as a function of their indices. Right now I initiate a zero array, and loop over the elements to construct my required array. This by itself takes a hefty time to evaluate. Is there any way to make the construction more efficient or faster, say by using GPU or parallel computing or the like?
I tried using cupy instead of numpy, but due to (independent) issues related to my dedicated GPU being not recognized by CUDA in my Arch linux installation, it actually took longer time than numpy.