I am reading a parquet file in dask and trying to reshape it to how I want it, but it seems rather impossible (I am quite new to dask too).
So, I have a parquet file which has some 8M x 384d numpy array stored. I read the pq file like so:
df=dd.read_parquet(file_name) # 2 columns "id" and "embeds"
arr=df.compute() # raw shape is (8034854,2)
print(arr.get("embed").shape)
(8034854,)
Each of those rows in array has a 384d vector, which I checked using this:
df=dd.read_parquet(file_name)
for (idx,part) in enumerate(df.partitions):
print("doing partition:", idx)
arr=part.compute()["embed"].to_numpy()
print(arr[0].shape) # prints: (384,)
break
What is want is to read my parquet file and simply get back (8034854,384)
as a numpy array for further processing.
I also have a related dask question: when I do df.compute()
as above, is the ordering preserved? I ask since each of those “embeds” are related to an “id” and I assume dask does preserve ordering when I call df.compute()["embeds"]
for example.
1