Thiết kế website giá rẻ

Question

I am using Dask to write multiple very large dataframes to a single parquet dataset in python.

The dataframes are simple and all the column types are either floats or strings.

I iterate over the dask dataframes, and call to_parquet on each one, overwritting if this is the first dataframe, and appending if it is not.

ddf.to_parquet(options.output_prefix,
               append=True if n> 0 else False,
               engine="pyarrow",
               write_index=False,
               write_metadata_file=True,
               compute=True)

On the second call I get the following error:

ValueError: Appended dtypes differ.
{('IJC_SAMPLE_1', 'object'), ('riExonStart_0base', dtype('float64')), ('riExonEnd', 'float64'), ('PValue', dtype('float64')), ('IncLevelDifference', dtype('float64')), ('chr', 'object'), ('SkipFormLen', dtype('float64')), ('ID', 'float64'), ('SJC_SAMPLE_1', 'object'), ('geneSymbol', 'object'), ('ID.1', 'float64'), ('downstreamEE', dtype('float64')), ('downstreamES', 'float64'), ('strand', 'object'), ('SJC_SAMPLE_2',
'object'), ('upstreamES', dtype('float64')), ('IncFormLen', 'float64'), ('FDR', dtype('float64')), ('IJC_SAMPLE_2', 'object'), ('ID.1', dtype('float64')), ('downstreamES', dtype('float64')), ('FDR', 'float64'), ('IJC_SAMPLE_1', string[pyarrow]), ('GeneID', 'object'), ('IncFormLen', dtype('float64')), ('chr', string[pyarrow]), ('GeneID', string[pyarrow]), ('IncLevel1', 'object'), ('SJC_SAMPLE_1', string[pyarrow]), ('geneSymbol', string[pyarrow]), ('riExonStart_0base', 'float64'), ('PValue', 'float64'), ('upstreamEE', dtype('float64')), ('IncLevelDifference', 'float64'), ('IncLevel1', string[pyarrow]), ('strand', string[pyarrow]), ('SJC_SAMPLE_2', string[pyarrow]), ('IncLevel2', 'object'), ('SkipFormLen', 'float64'), ('riExonEnd', dtype('float64')), ('IncLevel2', string[pyarrow]), ('upstreamEE', 'float64'), ('downstreamEE', 'float64'), ('ID', dtype('float64')), ('IJC_SAMPLE_2', string[pyarrow]), ('upstreamES', 'float64')}

The dtypes of the first and second dataframes are the same:

dtypes for frame 0:
ID                            float64 
GeneID                string[pyarrow] 
geneSymbol            string[pyarrow] 
chr                   string[pyarrow] 
strand                string[pyarrow] 
riExonStart_0base             float64 
riExonEnd                     float64 
upstreamES                    float64 
upstreamEE                    float64 
downstreamES                  float64 
downstreamEE                  float64 
ID.1                          float64 
IJC_SAMPLE_1          string[pyarrow] 
SJC_SAMPLE_1          string[pyarrow] 
IJC_SAMPLE_2          string[pyarrow] 
SJC_SAMPLE_2          string[pyarrow] 
IncFormLen                    float64 
SkipFormLen                   float64 
PValue                        float64 
FDR                           float64 
IncLevel1             string[pyarrow] 
IncLevel2             string[pyarrow] 
IncLevelDifference            float64 
dtype: object

dtypes for frame 1:
ID                            float64 
GeneID                string[pyarrow] 
geneSymbol            string[pyarrow] 
chr                   string[pyarrow] 
strand                string[pyarrow] 
riExonStart_0base             float64 
riExonEnd                     float64 
upstreamES                    float64 
upstreamEE                    float64 
downstreamES                  float64 
downstreamEE                  float64 
ID.1                          float64 
IJC_SAMPLE_1          string[pyarrow] 
SJC_SAMPLE_1          string[pyarrow] 
IJC_SAMPLE_2          string[pyarrow] 
SJC_SAMPLE_2          string[pyarrow] 
IncFormLen                    float64 
SkipFormLen                   float64 
PValue                        float64 
FDR                           float64 
IncLevel1             string[pyarrow] 
IncLevel2             string[pyarrow] 
IncLevelDifference            float64 
dtype: object

If seems that the string[pyarrow] columns are being stored as object for the first time round, and then on the second time round, dask/pyarrow is seeing these as different dtypes. I’ve tried various things around the using the schema parameter to to_parquet, including setting all the string[pyarrow] columns to object or np.object_ or pa.from_numpy_dtype(np.object_) (unsupported) or enforcing storing them as “string[pyarrow]” (no effect).

Thiết kế website giá rẻ

Danh mục

ValueError: Appended dtypes differ when appending two simple tables with dask