I am creating a parquet file containing a field that is a list of doubles. The schema I get is this:
message schema {
optional binary NAME (UTF8);
optional group VECTOR_FIELD (LIST) {
repeated group list {
required double item;
}
}
}
I want this schema:
message schema {
optional binary NAME (UTF8);
optional group VECTOR_FIELD (LIST) {
repeated double array;
}
}
I have found this in include/parquet/schema.h in Arrow but it does not say how to set the listencoding level I want when writing out a parquet file. The ListEncoding enum isn’t mentioned anywhere else in the header files.
// One-level encoding: Only allows required lists with required cells
// repeated value_type name
//
// Two-level encoding: Enables optional lists with only required cells
// <required/optional> group list
// repeated value_type item
//
// Three-level encoding: Enables optional lists with optional cells
// <required/optional> group bag
// repeated group list
// <required/optional> value_type item
//
// 2- and 1-level encoding are respectively equivalent to 3-level encoding with
// the non-repeated nodes set to required.
//
// The "official" encoding recommended in the Parquet spec is the 3-level, and
// we use that as the default when creating list types. For semantic completeness
// we allow the other two. Since all types of encodings will occur "in the
// wild" we need to be able to interpret the associated definition levels in
// the context of the actual encoding used in the file.
//
// NB: Some Parquet writers may not set ConvertedType::LIST on the repeated
// SchemaElement, which could make things challenging if we are trying to infer
// that a sequence of nodes semantically represents an array according to one
// of these encodings (versus a struct containing an array). We should refuse
// the temptation to guess, as they say.
struct ListEncoding {
enum type { ONE_LEVEL, TWO_LEVEL, THREE_LEVEL };
};
Here’s my example code
int main()
{
std::shared_ptr<arrow::MemoryPool> memoryPool = arrow::MemoryPool::CreateDefault();
arrow::ListBuilder values (memoryPool.get(), std::make_shared<arrow::DoubleBuilder>());
arrow::DoubleBuilder& db_ref = * (static_cast<arrow::DoubleBuilder*> (values.value_builder()));
std::vector<double> myVector1({1.1, 2.2, 3.0, 4.1});
std::vector<double> myVector2({100.1, 102.2, 103.0, 104.1});
values.Append();
db_ref.AppendValues (myVector1.data(), myVector1.size());
values.Append();
db_ref.AppendValues (myVector2.data(), myVector2.size());
std::shared_ptr<arrow::Array> doubleVectorArray;
values.Finish(&doubleVectorArray);
arrow::StringBuilder sb;
sb.Append("ABC");
sb.Append("XYZ");
std::shared_ptr<arrow::Array> stringArray;
sb.Finish(&stringArray);
std::vector<std::shared_ptr<arrow::Field>> schema_definition;
schema_definition.push_back(std::make_shared<arrow::Field> ("NAME", arrow::utf8()));
std::shared_ptr<arrow::Field> pField = std::make_shared<arrow::Field>("VECTOR_FIELD", arrow::list(arrow::field("item", std::make_shared<arrow::DoubleType>(), false)));
schema_definition.push_back(pField);
std::vector<std::shared_ptr<arrow::Array>> parquetArrays;
parquetArrays.push_back(stringArray);
parquetArrays.push_back(doubleVectorArray);
std::shared_ptr<arrow::Schema> schema = std::make_shared<arrow::Schema> (schema_definition);
std::shared_ptr<arrow::Table> parquetTable = arrow::Table::Make (schema, parquetArrays);
std::shared_ptr<arrow::io::FileOutputStream> outFile;
arrow::io::FileOutputStream::Open ("/tmp/test.parquet", &outFile);
parquet::arrow::WriteTable (*parquetTable, memoryPool.get(), outFile, 1 << 16);
return 0;
}
New contributor
user26431796 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.