I have very large datafiles (csv files that are several GB a piece, roughly a 3059 by 40-80k long table) that are used for constructing images from spectroscopy data. They’re standard tab delimited files, where the first list column is the x-axis values, and each consecutive column is the associated values for each pixel. Like such for tens of thousands of columns:
900.00 0.000 …
901.00 0.030 …
… … …
3999.00 0.801 …
4000.00 0.798 …
Normally, I can get these files as datapoint tables and have written a library with a bunch of functions and operations using the data (important detail: this whole library is column based). Recently, I had to explore the idea of manually converting this data from its native format. In that format, the non-x-axis values are stored as a comma delimited string which contains all the data end to end (so a single string a couple hundred thousand entries long). Such as:
(0.0234,0.0021,…,0.4120,0.3034)
I can currently extract the data and write it to a file, but because you can only write from left to right top to bottom and cant write ‘vertically’ to a file, I write the data row-based. Since the rest of my library operates on column-based data, I wanted to ‘rotate’ or ‘tip-over’ the contents of the file to the left so the columns are now the rows and plays nicely with the rest of my code rather than rewriting my library to be row based.
Here is what I have written:
def FileRotate(inputfile):
global outputfile
outputfile = targetFiles[fileIndex] + "_final.dpt"
# outputfile = "/path/to/input/file.txt"
with open(outputfile, 'w') as output_file:
outputcsv = csv.writer(output_file, delimiter = 't')
ncols = len(pd.read_csv(inputfile, nrows=1, encoding='utf-8', delimiter='t').columns)
print(f'nNumber of columns which will become number of rows: {ncols}')
for i in (iterable :=list(range(0,ncols))):
columntemp = pd.read_csv(inputfile, usecols=[i], encoding='utf-8', delimiter='t', header=None)
outputcsv.writerow(columntemp.iloc[:,0])
if i < iterable[-1]:
print(f'Column {i} has successfully been written. Beginning column {i+1}')
elif i == iterable[-1]:
print(f'nColumn {i} has successfully been written. File is complete.')
I’m very new to programming so I don’t think the code is the best, hence why I’m reaching out to you all. Currently, my strategy is, while iterating over the string of native data and putting it in x-axis sized chunks, to read one column of the input file (using pandas since it can read specific columns) and then write that column to the new file, so it would now be a row. The problem is, at its longest this rotation could take several hours. Is there an easy/more computationally savvy way of accomplishing this task? I’m also—-of course–open to new ways of approaching this problem as well. I couldn’t care less about the if and printing, thats just so I have some idea of the internals and the pace at which it was running. the declaration of outputfile as global is also for the rest of the code not shown.
Levi Friss is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.