I am trying to implement a custom catalyst expression in Spark, which will parse each column in a dataframe into a string array. The toy example is attached.
case class CustomExpression(child: Expression)
extends UnaryExpression {
override def dataType: DataType = ArrayType(StringType)
override protected def withNewChildInternal(newChild: Expression): Expression =
copy(child = newChild)
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
// Generate the code to produce the desired output
val digits = ctx.freshName("digits")
val value = ctx.freshName("value")
val code =
code"""
String[] $digits = new String[]{"1", "2", "3"};
${ev.value} = $digits;
"""
ev.copy(code = code, isNull = FalseLiteral)
}
and below is how to use it
val df = currentDf.withColumn(colName, new Column(new CustomExpression(col(colName).expr)))
when I want to display df with df.show()
, the compiler give error File ‘generated.java’, Line 32, Column 1: Expression “value_1” is not an rvalue. When I check the generated code using df.explain(“codegen”), it gives the error File ‘generated.java’, Line 56, Column 1: Expression “project_value_0” is not an rvalue. The generateed code snippet is
/* 055 */ String[] project_digits_0 = new String[]{"1", "2", "3"};
/* 056 */ **project_value_0** = project_digits_0;
/* 057 */ String[] project_digits_1 = new String[]{"1", "2", "3"};
/* 058 */ project_value_2 = project_digits_1;
/* 059 */ columnartorow_mutableStateArray_3[1].reset();
/* 060 */
/* 061 */ if (false) {
/* 062 */ columnartorow_mutableStateArray_3[1].setNullAt(0);
/* 063 */ } else {
/* 064 */ // Remember the current cursor so that we can calculate how many bytes are
/* 065 */ // written later.
/* 066 */ final int project_previousCursor_0 = columnartorow_mutableStateArray_3[1].cursor();
/* 067 */
/* 068 */ final ArrayData project_tmpInput_0 = **project_value_0;**
/* 069 */ if (project_tmpInput_0 instanceof UnsafeArrayData) {
/* 070 */ columnartorow_mutableStateArray_3[1].write((UnsafeArrayData) project_tmpInput_0);
/* 071 */ } else {
/* 072 */ final int project_numElements_0 = project_tmpInput_0.numElements();
Acooding to the doGenCode() implementation, project_value_0 is the variable of the expression code value ${ev.value}. How to fix the error?
In the above toy example, I am expecting the output df with one column of a string array.
Wangkai Jin is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.