I am using Dask to write multiple very large dataframes to a single parquet dataset in python.
The dataframes are simple and all the column types are either floats or strings.
I iterate over the dask dataframes, and call to_parquet
on each one, overwritting if this is the first dataframe, and appending if it is not.
ddf.to_parquet(options.output_prefix,
append=True if n> 0 else False,
engine="pyarrow",
write_index=False,
write_metadata_file=True,
compute=True)
On the second call I get the following error:
ValueError: Appended dtypes differ.
{('IJC_SAMPLE_1', 'object'), ('riExonStart_0base', dtype('float64')), ('riExonEnd', 'float64'), ('PValue', dtype('float64')), ('IncLevelDifference', dtype('float64')), ('chr', 'object'), ('SkipFormLen', dtype('float64')), ('ID', 'float64'), ('SJC_SAMPLE_1', 'object'), ('geneSymbol', 'object'), ('ID.1', 'float64'), ('downstreamEE', dtype('float64')), ('downstreamES', 'float64'), ('strand', 'object'), ('SJC_SAMPLE_2',
'object'), ('upstreamES', dtype('float64')), ('IncFormLen', 'float64'), ('FDR', dtype('float64')), ('IJC_SAMPLE_2', 'object'), ('ID.1', dtype('float64')), ('downstreamES', dtype('float64')), ('FDR', 'float64'), ('IJC_SAMPLE_1', string[pyarrow]), ('GeneID', 'object'), ('IncFormLen', dtype('float64')), ('chr', string[pyarrow]), ('GeneID', string[pyarrow]), ('IncLevel1', 'object'), ('SJC_SAMPLE_1', string[pyarrow]), ('geneSymbol', string[pyarrow]), ('riExonStart_0base', 'float64'), ('PValue', 'float64'), ('upstreamEE', dtype('float64')), ('IncLevelDifference', 'float64'), ('IncLevel1', string[pyarrow]), ('strand', string[pyarrow]), ('SJC_SAMPLE_2', string[pyarrow]), ('IncLevel2', 'object'), ('SkipFormLen', 'float64'), ('riExonEnd', dtype('float64')), ('IncLevel2', string[pyarrow]), ('upstreamEE', 'float64'), ('downstreamEE', 'float64'), ('ID', dtype('float64')), ('IJC_SAMPLE_2', string[pyarrow]), ('upstreamES', 'float64')}
The dtypes of the first and second dataframes are the same:
dtypes for frame 0:
ID float64
GeneID string[pyarrow]
geneSymbol string[pyarrow]
chr string[pyarrow]
strand string[pyarrow]
riExonStart_0base float64
riExonEnd float64
upstreamES float64
upstreamEE float64
downstreamES float64
downstreamEE float64
ID.1 float64
IJC_SAMPLE_1 string[pyarrow]
SJC_SAMPLE_1 string[pyarrow]
IJC_SAMPLE_2 string[pyarrow]
SJC_SAMPLE_2 string[pyarrow]
IncFormLen float64
SkipFormLen float64
PValue float64
FDR float64
IncLevel1 string[pyarrow]
IncLevel2 string[pyarrow]
IncLevelDifference float64
dtype: object
dtypes for frame 1:
ID float64
GeneID string[pyarrow]
geneSymbol string[pyarrow]
chr string[pyarrow]
strand string[pyarrow]
riExonStart_0base float64
riExonEnd float64
upstreamES float64
upstreamEE float64
downstreamES float64
downstreamEE float64
ID.1 float64
IJC_SAMPLE_1 string[pyarrow]
SJC_SAMPLE_1 string[pyarrow]
IJC_SAMPLE_2 string[pyarrow]
SJC_SAMPLE_2 string[pyarrow]
IncFormLen float64
SkipFormLen float64
PValue float64
FDR float64
IncLevel1 string[pyarrow]
IncLevel2 string[pyarrow]
IncLevelDifference float64
dtype: object
If seems that the string[pyarrow]
columns are being stored as object
for the first time round, and then on the second time round, dask/pyarrow is seeing these as different dtypes. I’ve tried various things around the using the schema parameter to to_parquet
, including setting all the string[pyarrow]
columns to object
or np.object_
or pa.from_numpy_dtype(np.object_)
(unsupported) or enforcing storing them as “string[pyarrow]” (no effect).