With spark 3.5.1, I’m getting the following exception:
24/09/09 16:03:49 WARN TransportChannelHandler: Exception in connection from /10.218.81.147:28102
java.lang.IllegalArgumentException: Too large frame: 1586112597202108408
at org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119)
at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
here are the details about the environment and scenario:
- platform we are using for deployment – kubernetes on baremetal (KOB)
- job type streaming and batch job
- in the kubernetes cluster we have the spark version same as what we are using in our app (3.5.1)
- this issue is seen when
a. the streaming job is waiting for incoming data and is currently idle
b. when there is too much data for batch jobs and
c. in between the data being processed (sometimes it gets flushed like other logs and sometimes it gets the job stuck)
What I believe is when it is trying to move dataframe from one executor to another or to the driver, the network size is falls short and connection gets severed (correct me if wrong).
Things we tried:
- shuffle.partition =3000 and shuffle.parallesim=3000
- spark network frame size for driver and executor both raising till 1024m (default is 128m)
I’ve searched for solutions over the web, but nothing worked. Any thoughts for fixes?