I have a data file like the following (simplified, I have more columns):
timestamp | frame_idx | gaze_pos_x | gaze_pos_y | gaze_dir_x | gaze_dir_y | gaze_dir_z | |
---|---|---|---|---|---|---|---|
0 | 2269.17 | 45 | 893.314 | 500.136 | 0.165454 | -0.0222454 | 0.985967 |
1 | 2274.17 | 45 | 896.61 | 502.564 | 0.176397 | -0.0098666 | 0.98427 |
2 | 2279.17 | 46 | 900.592 | 499.049 | 0.189087 | -0.018215 | 0.981791 |
3 | 2284.17 | 46 | 906.321 | 478.184 | 0.18891 | -0.0307506 | 0.981513 |
4 | 2289.17 | 46 | 893.465 | 502.793 | 0.175493 | -0.0210113 | 0.984257 |
5 | 2294.17 | 46 | 898.629 | 497.182 | 0.190142 | -0.0151722 | 0.981639 |
6 | 2299.3 | 46 | 893.554 | 496.782 | 0.183007 | -0.0150504 | 0.982996 |
7 | 2304.3 | 46 | 905.338 | 482.343 | 0.188236 | -0.0249608 | 0.981807 |
8 | 2309.3 | 46 | 897.44 | 495.476 | 0.187434 | -0.0199951 | 0.982074 |
9 | 2424.3 | 48 | 893.358 | 495.474 | 0.171512 | -0.0198278 | 0.984982 |
And an object like this (again simplified):
class Gaze:
def __init__(self, ts, frame_idx, gaze2D, gaze_dir3D=None):
self.ts = ts
self.frame_idx = frame_idx
self.gaze2D = gaze2D
self.gaze_dir3D = gaze_dir3D
where gaze2D
is a numpy array containing [gaze_pos_x, gaze_pos_y]
and gaze_dir3D
is a numpy array containing [gaze_dir_x, gaze_dir_y, gaze_dir_z]
.
I want to efficiently load in the data file and make one Gaze
object per row. I have implemented the below, but this is very slow:
def readDataFromFile(fileName):
gazes = []
data = pd.read_csv(str(fileName), delimiter='t', index_col=False, dtype=defaultdict(lambda: float, frame_idx=int))
allCols = tuple([c for c in data.columns if col in c] for col in (
'gaze_pos','gaze_dir'))
# allCols -> ([gaze_pos_x, gaze_pos_y],[gaze_dir_x, gaze_dir_y, gaze_dir_z]), a list can be empty if a set of columns is missing (gaze_dir is optional)
# run through all rows
for _, row in data.iterrows():
frame_idx = int(row['frame_idx']) # must cast to int as pd.Series seems to lose typing of dataframe.... :s
ts = row['timestamp']
# get all values (None if columns not present)
# again need to cast to float despite all items in the series being a float, because the dtype of the series is object... :s
args = tuple(row[c].astype('float').to_numpy() if c else None for c in allCols)
gazes.append(Gaze(ts, frame_idx, *args))
return gazes
As said, this is very slow, the row iteration takes forever, it is prohibitively slow for my use case. Is there a more efficient way of doing this? Using a similar read-in function using a csv.DictReader
is a little faster but still way too slow.